DA-FPN: Deformable Convolution and Feature Alignment for Object Detection

Fu, Xiang; Yuan, Zemin; Yu, Tingjian; Ge, Yun

doi:10.3390/electronics12061354

Open AccessArticle

DA-FPN: Deformable Convolution and Feature Alignment for Object Detection

by

Xiang Fu

^1,2,*,

Zemin Yuan

^1,2,

Tingjian Yu

^1,2 and

Yun Ge

^1,2

¹

School of Software, Nanchang Hangkong University, Nanchang 330063, China

²

Key Laboratory of Jiangxi Province for Image Processing and Pattern Recognition, Nanchang Hangkong University, Nanchang 330063, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(6), 1354; https://doi.org/10.3390/electronics12061354

Submission received: 9 February 2023 / Revised: 3 March 2023 / Accepted: 8 March 2023 / Published: 12 March 2023

Download

Browse Figures

Versions Notes

Abstract

:

This study sought to address the problem of the insufficient extraction of shallow object information and boundary information when using traditional FPN structures in current object detection algorithms, which degrades object detection accuracy. In this paper, a new FPN structure model, DA-FPN, is proposed. DA-FPN replaces the 1 × 1 convolution used in the conventional FPN structure for lateral connection with a 3 × 3 deformable convolution and adds a feature alignment module after the 2x downsampling operation used for lateral connection. This design allows the detection framework to extract more accurate information about the boundary of the object, particularly the boundary information of small objects. A bottom-up module was also added to incorporate the shallow information of the object more accurately into the high-level feature map, and a feature alignment module was added to the bottom-up module, thereby improving object detection accuracy. The experimental results show that DA-FPN can improve the accuracy of the single-stage object detection algorithms FoveaBox and GFL by 1.7% and 2.4%, respectively, on the MS-COCO dataset. This model was also found to improve the two-stage object detection algorithm SABL by 2.4% and offer higher small object detection accuracy and better robustness.

Keywords:

deformable convolution; feature alignment; object detection

1. Introduction

Object detection is one of the core problems in the field of computer vision research [1,2,3], where the task is to find all the objects of interest in an image and determine their class and location [4,5]. Object detection is also abundant and promising in real life—for example, in remote sensing image processing [6,7,8,9], UAV navigation [10,11,12], autonomous driving [13,14,15], medical diagnosis [16,17], face recognition [18,19], and many other fields [20,21].

Deep-learning-based object detection algorithms are mainly divided into two categories: (1) two-stage object detection algorithms represented by R-CNN [22], Fast-RCNN [23], Faster-RCNN [24], Cascade R-CNN [25], and SPPNet [26], and (2) single-stage object detection algorithms represented by YOLO [27,28,29,30] series and SSD [31,32,33] series. RetinaNet [34], FCOS [35], and ATSS [36]. However, it does not matter whether the model is a two-stage object detection algorithm or a single-stage object detection algorithm. The boundary information [37] of the object is very helpful in improving object detection accuracy. For example, the two-stage object detection algorithm SABL [38] corresponds to a new method of Bounding Box regression, which locates the box borders through the boundary information of the feature map content, replacing the traditional method of sliding windows through the centroid plus anchor, which can obtain richer object boundary information and also improve the accuracy of detection for small objects. Recently, since traditional detection algorithms are anchor-boxed, problems of positive and negative sample imbalance can appear during the detection process. Anchor-free detection methods have also been developed, including CenterNet [39] and FoveaBox [40]. Among them, CenterNet models the object as a point, predicts only the center point of the object, and directly regresses the length and width of the corresponding object by predicting the center point of the object. FoveaBox directly learns the possibility of the existence of the object and the bounding box coordinates without the use of the center point.

The above object detection algorithms all use a common network structure—the feature pyramid networks (FPN) [41] structure—because in most present-day object detection algorithms, the high-level network responds to the semantic features but does not possess much geometric information due to the small size of the high-level feature map, which is not conducive to object detection. The shallow network contains more geometric information but not many semantic features, which is not conducive to image classification. This problem has a large impact on the improvement of object detection accuracy. Therefore, most current object detection algorithms use FPN structures to fuse shallow and deep information. However, in traditional FPN modules, the feature map loses some shallow and boundary information when performing feature fusion and information flow, which are useful for object classification and localization.

Therefore, to solve the loss of object boundary information and shallow information when performing feature fusion and information flow with the traditional FPN structure, we propose a new FPN network structure model, DA-FPN. The main contributions of this paper are as follows:

The 1 × 1 convolution used by FPN for the lateral connection operation is replaced by 3 × 3 deformable convolution. By expanding the receptive field with 3 × 3 deformable convolution, the boundary information of the object in the feature map can be better extracted.
A feature alignment module is added after the 2x upsampling operation used in the lateral connection. The high-level feature map after 2x upsampling is aligned with the low-level feature map and then fused, enabling the object detection algorithm to extract more abundant and accurate boundary information for the object.
A new bottom-up module is added behind the traditional FPN structure, and a feature alignment module is also introduced in the bottom-up module. This module solves the problem where the shallow information of objects is lost in the process of obtaining the high-level feature map via multiple convolutional pooling operations of the low-level feature map, which enables the high-level feature map to fuse richer and more accurate shallow information and improves the accuracy of object classification and regression in the high-level feature map.

2. Related Work

Feature Pyramid Network Backbone: One of the main difficulties in object detection is the efficient handling of multi-scale features. Previous detectors were commonly based directly on the feature maps of the pyramidal feature hierarchy extracted from the backbone network for predictive classification and regression. Feature Pyramid Network (FPN), one of the pioneering studies, proposed using a top-down method to aggregate multi-scale characteristics. Based on this idea, M2det [42] presented using a U-shaped module for combining multi-scale anatomical features; STDL [43] presented using a scale transformation module based on cross-scale design features; PANet [44] connected the FPN with an aggregation network with bottom-up pathways, with G-FRNet [45] controlling the flow of information between feature elements by introducing a gate unit approach. Recently, NAS-FPN [46] was proposed to design feature network topologies automatically using a neural architecture search. Although NAS-FPN achieves better performance, it requires a large number of GPUs per hour to search during the search process. The final generated feature network is irregular and, therefore, difficult to interpret. In this paper, we improve upon the FPN. Our proposed DA-FPN effectively solves the situation where the object boundary information is lost when the feature map is fused by replacing the 1 × 1 convolution with a 3 × 3 deformable convolution and adding a feature alignment module after upsampling. We also pass the shallow information of the object to the high-level feature map accurately by adding a bottom-up module and introducing a feature alignment module in the bottom-up module. In this way, the problem of shallow information loss in the process of information flow in the high-level feature map is effectively solved.

Feature Alignment: In the case of successive downsampling processes that result in a loss of boundary detail information, SegNet [47] stores the maximum pooling index computed in the maximum pooling step in the encoder and then uses the decoder to perform nonlinear upsampling on the computed pooling index. In contrast to SegNet nets, which record spatial information in the encoder, GUN [48] attempts to learn bootstrap offsets before downsampling in the decoder and then upsamples the feature maps after these offsets. To address the misalignment of extracted features and ROI regions due to quantization during the RoIPooL operation, ROIAlign [49] avoids the use of quantization operations and uses linear interpolation to calculate the value of each ROI region. To solve the problem where the same features of the current frame and the adjacent frames appear at different pixel positions when restoring the video quality, TDAN [50] uses a temporal deformable alignment network module that adaptively aligns the current frame (reference frame) and the adjacent frames (supporting frames) using a network composed of deformable convolutions. To address the problem of feature fusion misalignment caused by the pooling downsampling approach, AlignSeg [51] and SFNet [52] align the feature mapping from adjacent layers and further enhance the feature mapping using the feature pyramid framework. In this paper, we propose a feature alignment module for the case of feature misalignment caused by the downsampling method in the FPN structure using a deformable convolution operation, which first finds the offset of each pixel point between two feature maps via deformable convolution and then applies these offsets to one of the feature maps, thus aligning the pixel points between two feature maps.

3. Our Approach

In this section, we focus on the three modules of our modified FPN. The FPN structural model popularly used in current object detection algorithms is shown in Figure 1a.

We use the ResNet [53] network as our backbone feature extraction network in this study, with

C_{i}

representing the feature map output from each layer of ResNet,

P_{i}

representing the feature map output from FPN,

C_{i}^{a}

representing the feature map output after 1 × 1 convolution, and

P_{i - 1}^{u}

representing the feature map

P_{i}

after 2x upsampling.

G_{i}

is the feature fusion operation,

B_{i}

is the operation of 2x upsampling in FPN, and

E_{i}

is the 1 × 1 convolution operation, as shown in Figure 1. In the FPN structure of this paper, we set the input feature maps of FPN as

C_{3}

,

C_{4}

, and

C_{5}

and the output feature maps of FPN as

P_{3}

,

P_{4}

,

P_{5}

,

P_{6}

, and

P_{7}

. The output feature map

P_{i}

can be mathematically formulated as

P_{i} = G_{i} (C_{i}^{a}, P_{i}^{u})

(1)

P_{i}^{u} = B_{i} (P_{i + 1})

(2)

C_{i}^{a} = E_{i} (C_{i})

(3)

The DA-FPN network structure model proposed in this paper is shown in Figure 1b. As can be seen from Figure 1b, the DA-FPN network structure proposed in this paper is relatively different from the traditional FPN network. Firstly, we added a deformable convolution module (DCM) by replacing the 1 × 1 convolution operation in the lateral connection with 3 × 3 deformable convolution. Secondly, a feature alignment module (LFAM) is added to the feature map

P_{i + 1}

after upsampling by performing a feature alignment operation on the feature map obtained after upsampling the feature map

P_{i + 1}

and then fusing the features with the feature map

C_{i}^{a}

. Finally, a bottom-up path module is added after the top-down path, and the feature alignment module (RFAM) is also introduced in the bottom-up module (BUM). In addition, in DA-FPN, we set the feature map output by the feature alignment module as

P_{i}^{a}

. The following section explains the role and implementation of the deformable convolution module, the feature alignment module, and the bottom-up module.

3.1. Deformable Convolutional Module

Conventional FPN structures use a 1 × 1 convolution for channel number reduction and then feature fusion with the feature map after 2x upsampling when performing lateral connection operations. However, using only 1 × 1 convolution creates a fixed structure with very limited modeling geometry capabilities, especially for the extraction of object boundary information. Additionally, most of the single-object detection algorithms now lack the ROI-Align operation, so it is difficult to learn the object boundary information in the proposed frame using 1 × 1 convolution alone for channel number reduction because the information is lost, which will reduce object detection accuracy, especially for the detection accuracy of small objects. Consequently, this object boundary information is very important for the detection of small objects. To solve this problem, we use 3 × 3 deformable convolution to replace the 1 × 1 convolution method for channel number reduction. The method of 3 × 3 deformable convolution can improve the accuracy of object detection by accurately learning the object boundary information and various deformation object information through an offset added to the position of each pixel point. The principle of deformable convolution and its usage are briefly described below. In contrast to normal convolution, deformable convolution introduces an offset in the receptive field, which can be learned so that the convolution area can be close to the actual shape of the object. In this way, an object of any shape can be learned. Sampling for deformable convolution is shown in Figure 2.

The standard convolution can be mathematically formulated as

y (p_{0}) = \sum_{p_{n} \in R} w (p_{n}) \cdot x (p_{0} + p_{n})

(4)

where

p_{0}

is each point on the output feature map corresponding to the convolution kernel center point; here,

p_{n}

is each offset of

p_{0}

in the range of the convolution kernel. For example, taking the 3 × 3 convolution as an example, for each output

y (p_{0})

, nine positions are sampled from the feature map, which is spread around from the central position

x (p_{0})

, with (−1, −1) representing the upper left corner of

x (p_{0})

and (1, 1) representing the lower right corner of

x (p_{0})

. For deformable convolution, the position of the elements acting on the convolution kernel changes from

p_{n}

in the standard convolution to

p_{n} + Δ p_{n}

. Deformable convolution can be mathematically formulated as

y (p_{0}) = \sum_{p_{n} \in R} w (p_{n}) \cdot x (p_{0} + p_{n} + Δ p_{n})

(5)

where

Δ p

is an offset. Since the offset is generally fractional, the feature values need to be calculated by bilinear interpolation. However, some useless region information may be introduced and interfere with feature extraction when using deformable convolution. To solve the problem of irrelevant regions, the weight coefficient

Δ m_{k}

ϵ

(0,1) is introduced in DCNv2 [54]; this coefficient becomes small if the region of a sampling point is not of interest. This method can be mathematically formulated as

y (p_{0}) = \sum_{p_{n} \in R} w (p_{n}) \cdot x (p_{0} + p_{n} + Δ p_{n}) \cdot Δ m_{k} .

(6)

In this paper, our deformable convolution method adopts the DCNv2 approach. Although we need to extract the object boundary information and various deformation objects, we also want to discard the interference information of some irrelevant regions to further improve the accuracy of object detection.

3.2. Feature Alignment Module

In the FPN module, a lateral connection operation is performed to fuse feature map

C_{i}

output from the ResNet network and feature map

P_{i + 1}

output from the top-down module for the feature fusion operation. Fusion can use two possible methods: concat and add. In the FPN module, added summation is generally chosen for feature fusion in the lateral connection. However, since the feature map

P_{i + 1}

is obtained from the feature map

C_{i}

via continuous convolution and the ReLu operation, if the feature map

P_{i + 1}

is directly 2x upsampled and then fused with the feature map

C_{i}

through the addition of pixels, the spatial location information will be misaligned. This situation can make the extracted information around the object boundary inaccurate or even lost, thus leading to a reduction in the accuracy of the object detection algorithm. To solve this problem, it is essential to use a feature alignment operation before feature fusion of the feature map

C_{i}

and feature map

P_{i + 1}

. Therefore, this paper proposes a feature alignment module that is added to the bottom-up module after the feature map

P_{i + 1}

in the top-down path is upsampled 2x.

Since this paper employs the same method of feature alignment module implementation used in the DA-FPN structure, we only present the feature alignment module used in the top-down path. In this feature alignment module, we first need to find the offset of the feature map

P_{i}^{u}

relative to the feature map

C_{i}^{a}

via the relative operation. Then, we apply these offsets to the feature map

P_{i}^{u}

to obtain the new feature map

P_{i}^{a}

and perform feature fusion operations on the feature map

P_{i}^{a}

and feature map

C_{i}^{a}

to obtain the new feature map

P_{i}

. The feature alignment module used in this paper can be mathematically formulated as

Δ i = G_{a} ([P_{i}^{u}, C_{i}^{a}])

(7)

P_{i}^{a} = G_{b} (P_{i}^{u}, Δ i)

(8)

P_{i} = G_{c} (C_{i}^{a}, P_{i}^{a})

(9)

where

[P_{i}^{u}, C_{i}^{a}]

is the operation of concatenating feature maps

P_{i}^{u}

and

C_{i}^{a}

;

Δ i

is the offset of the relevant pixel points of feature map

P_{i}^{u}

; and feature maps

C_{i}^{a}

,

G_{a}

, and

G_{b}

represent the operations of the corresponding deformable convolution, while

G_{c}

represents the operations of feature fusion. A complete schematic of the feature alignment module is shown in Figure 3, where

C_{i}^{a}

is the feature map generated by the 3 × 3 convolution of the feature map

C_{i}

, and

P_{i + 1}

is the feature map output by the FPN module. Specifically, before the feature fusion of feature map

P_{i}^{u}

with feature map

C_{i}^{a}

for the pixel additive operation, the horizontal and vertical offsets of the corresponding pixel points between the two feature maps are first learned using a deformable convolution. Then, the learned offsets are applied to the feature map

P_{i}^{u}

by deformable convolution to adjust the offset positions of the relevant pixel points to feature map

P_{i}^{a}

, which is aligned with the obtained spatial position information of feature map

C_{i}^{a}

.

3.3. Bottom-Up Module

In current object detection algorithms, a feature pyramid network (FPN) is generally used to improve their accuracy. An FPN is used because high-level feature maps have more abundant semantic information, while low-level feature maps have abundant geometric detail information. However, since the high-level feature map is obtained from the low-level feature map after many convolution pooling operations, the high-level feature map will lose a significant amount of shallow information that is beneficial for object classification. This information is extremely helpful for ensuring the correctness of object classification and regression in the high-level feature map. For this problem, it is necessary to re-aggregate low-level feature maps with high-level feature maps. As a result, in this study, we introduce a new bottom-up module behind the FPN to improve the classification and localization of objects in high-level feature maps, as well as the accuracy of the object detection algorithm. As illustrated in Figure 1b, we take the FPN’s low-level feature map output and re-fuse it with the high-level feature map through a bottom-up approach. Since the bottom-up path requires little convolution and pooling operations, it can fully retain the detailed information of the low-level feature map, which is important for the classification and regression of the object in the high-level feature map. In addition, since the bottom-up module also uses the downsampling operation to perform the pooling operation, which likewise results in feature misalignment when the feature map is fused, we introduce the feature alignment module in the bottom-up module to perform the feature map alignment operation, which can further improve the accuracy of the object detection algorithm.

A schematic diagram of the lateral connection operation used by the bottom-up module in the proposed DA-FPN architecture is shown in Figure 4. Here,

F_{i}

is the new feature map generated after the feature fusion of the two feature maps. As shown in the bottom-up module in Figure 1b, we define the DA-FPN output feature map as {

F_{3}

,

F_{4}

,

F_{5}, F_{6}, F_{7}

}. In this study, we use a downsampling operation with a stride of 2 to aggregate the detailed information in the shallow feature map

F_{3}

step by step up to

F_{7}

, which is an iterative process. These feature maps {

F_{3}

,

F_{4}

,

F_{5}

,

F_{6}

,

F_{7}

} that aggregate low-level information are then sent to the detection framework for object classification and regression. In these bottom-up modules, the number of channels of the output feature maps is 256. With this operation, the feature map

F_{i + 1}

can fuse the shallow information of the shallow feature map

F_{i}

after only minimal convolution and pooling operations, reducing the loss of shallow information of objects during the process of multiple convolution and pooling operations. This process effectively improves the classification and localization of objects in high-level feature maps.

4. Experiments

4.1. Experimental Settings

Dataset. To validate the performance of the DA-FPN network structure proposed in this study, we performed all our experiments on the MS COCO [55] dataset. The MS COCO dataset contains both MS COCO 2014 and MS COCO 2017 versions. Specifically, the method proposed in this paper was evaluated on the MS COCO2017 dataset, which contains more than 80 categories and 1.5 million object instances. In total, 80 k images were used for training, 40 k images for validation, and 20 k images for testing. All models in this paper were trained on trainval 35 k. Then, we used another 5 k images in the validation set for testing and visualization.

Metrics. The MS-COCO dataset uses AP metrics to characterize detector performance. The average precision rate (AP) is calculated across 10 different IoU thresholds (i.e., 0.5:0.05:0.95) and all categories. AP is regarded as the most important metric for the MS-COCO dataset and can be used to evaluate performance at various object scales, such as small objects (area < 32²), medium objects (32² < area < 96²), and large objects (area > 96²).

Implementation Details. All experiments in this paper are based on the MMDection framework. We trained three network structure models, FoveaBox, GFL [56], and SABL, setting the initial learning rate of FoveaBox and GFL to 0.0025 and the initial learning rate of SABL to 0.005. The number of epochs was set to 12. Otherwise, if not specified, the backbone feature extraction networks used in our networks were all ResNet50 pre-trained on ImageNet. For all three network models, we used the DA-FPN structure to replace their original FPN structures. We used ubuntu18.04 as our operating system, Python 3.7 as our programming language, PyTorch 1.7.0 as our deep learning framework, and two 10 GB NVIDIA 2080Ti GPUs with CUDA version 10.1 for training.

4.2. Results

To verify the effectiveness and generality of our DA-FPN structure relative to the traditional FPN structure, in this section, we compare our method with popular two-stage object detection algorithms such as Faster R-CNN [24], Cascade R-CNN [25], single-stage object detection algorithms FCOS [35], and ATSS [36]. We also replace the FPN structures in the two-stage object detection algorithm SABL and the single-stage object detection algorithms FoveaBox and GFL using the DA-FPN structure in this paper. The experimental results are listed in Table 1. The data in this table show that when we use ResNet50 as our backbone, the method in this paper can improve the accuracy of the original FoveaBox, GFL, and SABL object detection algorithms by 1.7%, 2.4%, and 2.4%, respectively. Among them, the accuracy of

A P_{S}

is improved by 2.2%, 2.2%, and 2%, respectively. The accuracy of

A P_{L}

improves by 2%, 4.3%, and 3.2%, respectively. When we use ResNet101 as our backbone for feature extraction, the method in this paper is found to improve the accuracy of the original FoveaBox, GFL, and SABL object detection algorithms by 1.5%, 2%, and 1.4%, respectively. This result also fully illustrates that the proposed method can effectively solve the problem of object accuracy reduction due to the insufficient extraction of object boundary information and shallow information in traditional FPN.

4.3. Ablation Study

To verify the effectiveness of the deformable convolution module (DCM), feature alignment module (LFAM), and bottom-up module (BUM) in the DA-FPN structure proposed in this paper to improve the accuracy of the object detection algorithm, we performed corresponding ablation experiments in the single-stage object detection algorithm GFL, whose experimental data are shown in Table 2. This table show that using the deformable convolution module can improve the accuracy of

A P_{s}

by 1.2%, which represents the largest improvement among the three modules. This result occurred because using 3 × 3 deformable convolution instead of 1 × 1 convolution can capture more boundary information of small objects and improve the detection accuracy of objects. In addition, the bottom-up module was found to improve the values of

A P_{M}

and

A P_{L}

by 1.6% and 2.8%, respectively, because the bottom-up module can effectively fuse the shallow information of the object into the high-level feature map, thereby improving the detection accuracy of large objects.

In the DA-FPN structure, we replaced the 1 × 1 convolution used in the FPN lateral connection with a 3 × 3 deformable convolution. There are two different lateral connection operations in the traditional FPN structure: behind the backbone feature extraction feature map

C_{i}

and behind the top-down feature map

P_{i}

. To verify the replacement location of our 3 × 3 deformable convolutional module, corresponding experiments were performed on the single-stage objective detection algorithm FoveaBox. The experimental results are shown in Table 3. From the data in Table 3, we find that when we replace only the 1 × 1 convolution behind the feature map

C_{i}

with a 3 × 3 deformable convolution module, our detection accuracy improves by 0.5%, while the accuracy remains the same when we replace only 1 × 1 convolution behind the feature map

P_{i}

layer with a deformable convolution module. This occurs because the feature map

P_{i + 1}

is obtained from the feature map

C_{i}

via multiple convolution pooling operations, which causes the feature map

P_{i + 1}

to lose a large amount of object boundary information in the information flow process. Thus, it is necessary to use 3 × 3 deformable convolution to expand the receptive field to obtain more object boundary information when fusing features with the feature map

C_{i}

, while in the bottom-up module, since the feature map

F_{i}

is only obtained from the feature map

P_{i}

through few convolution operations, the feature map does not lose a great deal of object boundary information. Thus, the effect of using 3 × 3 deformable convolutions to replace 1 × 1 convolution is not obvious. Since using the deformable convolution module adds some computation, only the 1 × 1 convolution behind the feature map

C_{i}

is replaced in the DA-FPN structure using the deformable convolution module.

We also added the feature alignment module (RFAM) in the bottom-up module. To verify the effectiveness of adding the feature alignment module (RFAM) in the bottom-up module, we performed the corresponding experiments in the single-stage object detection algorithm GFL and the two-stage object detection algorithm SABL, the experimental data of which are shown in Table 4 and Table 5. Based on the experimental data, by adding the feature alignment module to the bottom-up module, we improved the object detection accuracy in GFL and SABL by 1.0% and 0.8%, respectively. Additionally, the accuracy values of

{AP}_{L}

improved by 1.9% and 2.1%, respectively, because the retransmission of shallow information to the high-level feature map through the bottom-up path also causes feature misalignment in the feature map when feature fusion is performed due to the use of the 2x downsampling operation. Especially in two-stage object detection algorithms, this lost information is important for the classification and regression of objects in high-level feature maps.

Table 2 shows that among the three modules improved by DA-FPN, LFAM, and BUM perform very well in the single-stage object detection algorithms. To confirm the generality of these two modules in the two-stage object detection algorithms, ablation experiments with these two modules were performed using the two-stage objective detection algorithm SABL. The experimental data are shown in Table 6. These data indicate that the two modules are also very applicable in the two-stage object detection algorithms, improving the object detection accuracy value from 40.0% to 41.9% in SABL, demonstrating an accuracy increase of 1.9%. When only the feature alignment module was used, the small target detection accuracy

A P_{S}

in the detection algorithm improved from 22.9% to 24.6%, thus improving the overall accuracy by 1.7%. This result shows that there was also a decrease in small object detection accuracy due to feature misalignment in the two-stage object detection algorithms. When using only the bottom-up module, we improved the value of

A P_{L}

from 52.0% to 54.5%, demonstrating a 2.5% increase in accuracy. This result shows that the bottom-up module effectively improved the detection accuracy of large objects because we effectively fused information from the low-level feature maps to high-level feature maps and reduced the loss of shallow information. These measures effectively improved the detection accuracy of large objects.

To verify whether the proposed method in this paper can better extract the boundary information of the object, we performed visualization experiments using FoveaBox. The experimental results are shown in Figure 5. Figure 5 shows that the proposed method can accurately extract the boundary information of the object and thus improve the accuracy of target detection. For example, as shown in the first column, the bounding box generated by the FoveaBox method does not completely frame the truck, while the bounding box generated by the method in this paper completely frames the truck because the FoveaBox method does not extract enough information about the object boundary; this problem was accurately solved by our method. From the second column, we can also see that the FoveaBox method does not detect boundary information of the zebra’s ears and tail, while the proposed method accurately extracts this boundary information.

To visualize whether the proposed DA-FPN method performs better than the traditional FPN structure, we conducted comparison experiments in both the single-stage object detection algorithm GFL and the two-stage object detection algorithm SABL and visualized the results as shown in Figure 6. From the figure, we can see that DA-FPN can accurately detect those objects that are detected wrongly and missed—especially the detection of small objects. For example, as can be seen from the figure on the left of the second row, the GFL-DAFPN detection algorithm proposed in this paper can detect some small objects such as umbrellas and potted plants that were missed in the GFL algorithm, and successfully classify some of these incorrectly detected objects. In the figure in the middle of the second row, the DA-FPN algorithm successfully removed the tennis racket that was incorrectly located in the GFL algorithm. In the figure on the right of the second row, the algorithm was able to detect the missed traffic light and handbag, as well as the incorrectly detected person. Similarly, in the figure on the left of the fourth row, the SABL-DAFPN algorithm successfully removed the ties that were incorrectly located in the SABL algorithm. In the figure in the middle of the fourth row, the SABL-DAFPN algorithm successfully removed the objects that the SABL algorithm misdetected as kites for the identifiers. As shown in the image on the right side of the fourth row, the SABL-DAFPN algorithm was able to detect missed benches as well as distant hot air balloons. The above analysis shows that the DA-FPN method proposed in this paper can effectively detect some objects that were detected incorrectly and mistakenly and can also detect more small objects.

5. Conclusions

In this paper, we found that most object detection algorithms using the traditional FPN module will produce errors when the object boundary information and shallow information extraction are insufficient. Such errors occur because in the traditional FPN module, one first uses the 1 × 1 convolution operation to reduce the number of channels of the feature map. However, 1 × 1 convolution cannot fully extract the object boundary information. Thus, in this paper, we use 3 × 3 deformable convolutions instead of the 1 × 1 convolution method to reduce the number of channels. In addition, the feature fusion of high-level feature maps and low-level feature maps will be preceded by an upsampling operation, which will result in feature misalignment after feature map fusion. This leads to incorrect boundary information. Therefore, a feature alignment module is proposed in this paper to perform feature map alignment. Since the higher-level feature maps in the FPN structure will lose a large amount of shallow information about the object during the information flow, this paper proposed a bottom-up module that can accurately retransmit the lost shallow information of the object to the higher-level feature maps. To test the effectiveness and generality of these modules, we conducted experiments on both single-stage and two-stage object detection algorithms. The experimental results showed that the three modules proposed in this paper can effectively solve the problem of the insufficient extraction of object boundary information and shallow information in traditional object detection algorithms.

Author Contributions

Conceptualization, data curation and formal analysis, Z.Y. and T.Y.; Funding acquisition, X.F.; Investigation, Z.Y. and X.F.; Methodology, Z.Y.; Project administration, Z.Y. and Y.G.; Resources, X.F.; Software, Z.Y.; Supervision, X.F. and Y.G.; Validation, visualization and writing—original draft, Z.Y.; Writing—review and editing, X.F. and Y.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grants 42261070.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Sobbahi, R.A.; Tekli, J. Comparing Deep Learning Models for Low-Light Natural Scene Image Enhancement and Their Impact on Object Detection and Classification: Overview, Empirical Evaluation, and Challenges. Image Commun. 2022, 109, 116848. [Google Scholar] [CrossRef]
Yang, D.; Peng, B.; Al-Huda, Z.; Malik, Z.; Zhai, D. An Overview of Edge and Object Contour Detection. Neurocomputing 2022, 488, 470–493. [Google Scholar] [CrossRef]
Sahatova, K.; Balabaeva, K. An Overview and Comparison of XAI Methods for Object Detection in Computer Tomography. Procedia Comput. Sci. 2022, 212, 209–219. [Google Scholar] [CrossRef]
Wang, G.; Zhang, X.; Peng, Z.; Jia, X.; Tang, X.; Jiao, L. MOL: Towards accurate weakly supervised remote sensing object detection via Multi-view nOisy Learning. ISPRS J. Photogramm. Remote Sens. 2023, 196, 457–470. [Google Scholar] [CrossRef]
Xu, S.; Zhang, M.; Song, W.; Mei, H.; He, Q.; Liotta, A. A systematic review and analysis of deep learning-based underwater object detection. Neurocomputing 2023, 527, 204–232. [Google Scholar] [CrossRef]
Wu, Y.; Zhao, Q.; Li, Y. Top-to-down Segment Process Based Urban Road Extraction From High-Resolution Remote Sensing Image. Egypt. J. Remote Sens. Space Sci. 2022, 25, 851–861. [Google Scholar] [CrossRef]
Jiang, S.; Zhi, X.; Zhang, W.; Wang, D.; Hu, J.; Chen, W. Remote Sensing Image Fine-Processing Method Based on The Adaptive Hyper-Laplacian Prior. Opt. Lasers Eng. 2021, 136, 106311. [Google Scholar] [CrossRef]
Li, J.; Liu, Y. Non-blind Post-Processing Algorithm for Remote Sensing Image Compression. Knowl.-Based Syst. 2021, 214, 0950–7051. [Google Scholar] [CrossRef]
Wu, Z.; Wan, S.; Wang, X.; Tan, X.; Zou, X.; Li, X.; Chen, Y. A Benchmark Data Set for Aircraft Type Recognition from Remote Sensing Images. Appl. Soft Comput. 2020, 89, 106132. [Google Scholar] [CrossRef]
Zhao, L.; Bi, X.; Li, G.; Dong, Z.; Xiao, N.; Zhao, A. Robust Traveling Salesman Problem with Multiple Drones: Parcel Delivery Under Uncertain Navigation Environments. Logist. Transp. Rev. 2022, 168, 102967. [Google Scholar] [CrossRef]
Liu, Y.; Xie, K.; Huang, H. VGF-Net: Visual-Geometric Fusion Learning for Simultaneous Drone Navigation and Height Mapping. Graph. Model. 2021, 116, 101108. [Google Scholar] [CrossRef]
Kumar, R.; Agrawal, A. Drone GPS Data Analysis for Flight Path Reconstruction: A Study on DJI, Parrot & Yuneec Make Drones. Digit. Investig. 2021, 38, 301182. [Google Scholar]
Wang, L.; Yang, M.; Li, Y.; Hou, Y. A Model of Lane-changing Intention Induced by Deceleration Frequency in An Automatic Driving Environment. Stat. Mech. Its Appl. 2022, 604, 127905. [Google Scholar] [CrossRef]
Li, J.; Jiang, F.; Yang, J.; Kong, B.; Gogate, B.; Dashtipour, K.; Hussain, A. Lane-DeepLab: Lane Semantic Segmentation in Automatic Driving Scenarios for High-Definition Maps. Neurocomputing 2021, 465, 15–25. [Google Scholar] [CrossRef]
Peng, T.; Su, L.; Zhang, R.; Guan, Z.; Zhao, H.; Qiu, Z.; Zong, C.; Xu, H. A New Safe Lane-Change Trajectory Model and Collision Avoidance Control Method for Automatic Driving Vehicles. Expert Syst. Appl. 2020, 141, 112953. [Google Scholar] [CrossRef]
Chen, X.; Li, Y.; Yao, L.; Adeli, E.; Zhang, Y.; Wang, X. Generative Adversarial U-Net for Domain-free Few-Shot Medical Diagnosis. Pattern Recognit. Lett. 2022, 157, 112–118. [Google Scholar] [CrossRef]
Liu, X.; Wang, Z.; Zhang, S.; Garg, H. Novel Correlation Coefficient between Hesitant Fuzzy Sets with Application to Medical Diagnosis. Expert Syst. Appl. 2021, 183, 115393. [Google Scholar] [CrossRef]
Honig, T.; Shoham, A.; Yovel, G. Perceptual Similarity Modulates Effects of Learning From Variability on Face Recognition. Vis. Res. 2022, 201, 108128. [Google Scholar] [CrossRef]
Zhu, Y.; Jiang, Y. Optimization of Face Recognition Algorithm Based on Deep Learning Multi Feature Fusion Driven by Big Data. Image Vis. Comput. 2022, 104, 104023. [Google Scholar] [CrossRef]
Zeng, J.; Ouyang, H.; Liu, M.; Leng, L.; Xiang, F. Multi-scale YOLACT for instance segmentation. J. King Saud Univ.–Comput. Inf. Sci. 2022, 34, 9419–9427. [Google Scholar] [CrossRef]
Ouyang, H.; Zeng, J.; Leng, L. Inception Convolution and Feature Fusion for Person Search. Sensors 2023, 23, 1984. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
Ren, S.; He, K.; Girshick, R. Faster R-CNN: Towards Realtime Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [Green Version]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into High Quality Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 9–13 June 2018. [Google Scholar]
He, K.; Zhang, X.; Ren, S. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [Green Version]
Redmon, J.; Divvala, S.; Girshick, R. You Only Look Once: Unified, Real-time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D. SSD: Single Shot Multibox Detector. In Computer Vision–ECCV 2016, Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; Volume 9905, pp. 21–37. [Google Scholar]
Fu, C.Y.; Liu, W.; Ranga, A. DSSD: Deconvolutional Single Shot Detector. arXiv 2017, arXiv:1701.06659. [Google Scholar]
Jeong, J.; Park, H.; Kwak, N. Enhancement of SSD by Concatenating Feature Maps for Object Detection. arXiv 2017, arXiv:705.09587. [Google Scholar]
Lin, T.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef] [Green Version]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 26–35 October 2019. [Google Scholar]
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S. Bridging the Gap Between Anchor-based and Anchor-free Detection via Adaptive Training Sample Selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Fleming, M.; Safaeinili, N.; Knox, M.; Hernandez, E.; Brewster, A. Between health care and social services: Boundary objects and cross-sector collaboration. Soc. Sci. Med. 2023, 320, 115758. [Google Scholar] [CrossRef]
Wang, J.; Zhang, W.; Cao, Y.; Chen, K. SABL: Side-Aware Boundary Localization for More Precise Object Detection. arXiv 2020, arXiv:1912.04260. [Google Scholar]
Duan, K.; Bai, S.; Xie, L.; Qi, H. CenterNet: Keypoint Triplets for Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27–32 November 2019. [Google Scholar]
Kong, T.; Sun, F.; Liu, H.; Jiang, Y. FoveaBox: Beyound Anchor-Based Object Detection. IEEE Trans. Image Process. 2020, 29, 7389–7398. [Google Scholar] [CrossRef]
Lin, T.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Zhao, Q.; Sheng, T.; Wang, Y.; Tang, Z.; Chen, Y. M2Det: A Single-Shot Object Detector Based on Multi-Level Feature Pyramid Network. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 29–31 January 2019; pp. 9259–9266. [Google Scholar]
Zhou, P.; Ni, B.; Geng, C.; Hu, J.; Xu, Y. Scale-Transferrable Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Islam, M.; Rochan, M.; Bruce, N.; Wang, Y. Gated feedback refinement network for dense image labeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Ghiasi, G.; Lin, T.; Quoc, V. NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Mazzini, D. Guided Upsampling Network for Real-time Semantic Segmentation. arXiv 2018, arXiv:1807.07466. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P. Mask r-cnn. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 386–397. [Google Scholar] [CrossRef]
Tian, Y.; Zhang, Y.; Fu, Y.; Xu, C. TDAN: Temporally-Deformable Alignment Network for Video Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Huang, Z.; Wei, Y.; Wang, X.; Liu, W. AlignSeg: Feature-Aligned Segmentation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 550–557. [Google Scholar] [CrossRef]
Li, X.; You, A.; Zhu, Z.; Zhao, H. Semantic Flow for Fast and Accurate Scene Parsing. In Computer Vision–ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; Volume 12346, pp. 775–793. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Wang, R.; Shivanna, R.; Cheng, D.; Jain, S.; Lin, D.; Hong, L. DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems. In Proceedings of the Web Conference 2021, New York, NY, USA, 19–23 April 2021. [Google Scholar]
Lin, T.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J. Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection. arXiv 2020, arXiv:2006.04388. [Google Scholar]

Figure 1. Overview comparison between FPN and DA-FPN: (a) Feature Pyramid Network Structure (FPN); (b) DA-FPN Network Structure, where DCM is the Deformable Convolution Module, LFAM is the Feature Alignment Module used in the top-down path, BUM is the Bottom-up Module, and RFAM is the Feature Alignment Module used in BUM.

Figure 2. Illustration of the sampling locations in 3 × 3 standard and deformable convolutions. (a) is the standard convolution with size 3 × 3; (b) is deformable convolution, which allows our convolution kernel to assume an arbitrary shape by adding a direction vector to each convolution kernel based on (a); (c,d) are special forms of the deformable convolution.

Figure 3. Feature Alignment Module. C represents the number of channels of the feature map, N represents the size of the convolution kernel (e.g., a 3 × 3 convolution kernel with an N size of 9), and 2N is the horizontal and vertical offset of the corresponding pixel points.

Figure 4. Bottom-up module.

P_{i}^{a}

is the feature map output by the 1 × 1 convolution operation of the feature map P_i, and

F_{i - 1}^{a}

is the feature map F_i₋₁ obtained by the 2x downsampling and feature alignment operations.

Figure 4. Bottom-up module.

P_{i}^{a}

is the feature map output by the 1 × 1 convolution operation of the feature map P_i, and

F_{i - 1}^{a}

is the feature map F_i₋₁ obtained by the 2x downsampling and feature alignment operations.

Figure 5. Visualization of DA-FPN on the effects of object boundary information extraction.

Figure 6. Visualization results of DA-FPN in the object detection algorithm.

Table 1. Comparison to the mainstream methods on COCO test-dev.

Method	Back Bone	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
Faster R-CNN [24]	ResNet-50-FPN	37.8	58.6	41.0	21.6	41.5	49.3
Faster R-CNN [24]	ResNet-101-FPN	39.4	60.1	43.1	22.4	43.7	51.1
Cascade R-CNN [25]	ResNet-50-FPN	40.3	58.6	44.0	22.5	43.8	52.9
RetinaNet [34]	ResNet-50-FPN	36.5	55.4	39.1	20.4	40.3	48.1
RetinaNet [34]	ResNet-101-FPN	38.5	57.6	41.0	21.7	42.8	50.4
FCOS [35]	ResNet-50-FPN	36.6	55.7	38.8	20.7	40.1	47.4
FCOS [35]	ResNet-101-FPN	41.5	60.7	45.0	24.4	44.8	51.6
ATSS [36]	ResNet-50-FPN	39.4	57.6	42.8	23.6	42.9	50.3
ATSS [36]	ResNet-101-FPN	43.6	62.1	47.4	26.1	47.0	53.6
FoveaBox [40]	ResNet-50-FPN	36.8	56.5	39.1	19.7	40.0	48.9
FoveaBox [40]	ResNet-101-FPN	38.6	57.9	41.1	21.6	42.5	50.4
GFL [56]	ResNet-50-FPN	39.8	57.8	42.9	22.4	43.6	51.9
GFL [56]	ResNet-101-FPN	42.9	61.2	46.5	27.3	46.9	53.3
SABL [38]	ResNet-50-FPN	40.0	58.1	42.9	22.9	44.0	52.0
SABL [38]	ResNet-101-FPN	41.7	59.8	45.2	24.1	45.9	54.6
FoveaBox-DAFPN	ResNet-50-FPN	38.5	58.3	40.7	21.9	42.1	50.9
FoveaBox-DAFPN	ResNet-101-FPN	40.1	58.1	43.4	22.1	44.2	52.8
GFL-DAFPN	ResNet-50-FPN	42.2	60.2	45.6	24.6	45.6	56.2
GFL-DAFPN	ResNet-101-FPN	44.9	63.1	49.0	28.0	49.1	57.2
SABL-DAFPN	ResNet-50-FPN	42.4	60.7	45.5	24.9	46.1	55.2
SABL-DAFPN	ResNet-101-FPN	43.1	61.2	46.9	24.9	47.4	57.6

Table 2. Effectiveness of the modules in the DA-FPN structure. where √ means that the module is used in the structure, - means that the module is not used.

DCM	LFAM	BUM	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
-	-	-	39.8	57.8	42.9	22.4	43.6	51.9
√	-	-	41.1	59.2	44.3	23.6	44.7	54.2
-	√	-	40.6	58.5	44.0	22.9	44.3	53.8
-	-	√	41.1	58.9	44.4	22.5	45.2	54.7
√	√	-	40.9	59.1	44.2	24.2	44.6	53.2
√	-	√	42.0	60.0	45.5	23.8	45.5	56.2
-	√	√	41.7	59.4	44.8	23.7	45.0	56.1
√	√	√	42.2	60.2	45.6	24.6	45.6	56.2

Table 3. Deformable convolutional substitution position. √ means that a substitution has been made in that position, - means that there is no.

Behind C_i	Behind P_i	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
-	-	38.0	57.7	40.5	21.6	41.5	49.8
√	-	38.5	58.3	40.7	21.9	42.1	50.9
-	√	38.0	57.5	40.6	21.8	41.1	50.7
√	√	38.4	58.1	40.8	21.7	41.6	50.6

Table 4. Add feature alignment module (RFAM) to GFL. √ means the module is added, - means it is not.

RFAM	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
-	40.1	58.1	43.4	22.1	44.2	52.8
√	41.1	58.9	44.4	22.5	45.2	54.7

Table 5. Add feature alignment module (RFAM) to SABL. √ means the module is added, - means it is not.

RFAM	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
-	40.3	58.2	43.2	23.2	44.1	52.4
√	41.1	59.0	44.2	23.1	44.9	54.5

Table 6. Addition of LFAM and BUM modules to SABL. √ means the module is added, - means it is not.

LFAM	BUM	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
-	-	40.0	58.1	42.9	22.9	44.0	52.0
√	-	41.0	59.6	44.2	24.6	45.1	52.8
-	√	41.1	59.0	44.2	23.1	44.9	54.5
√	√	41.9	60.1	45.3	24.7	45.9	54.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fu, X.; Yuan, Z.; Yu, T.; Ge, Y. DA-FPN: Deformable Convolution and Feature Alignment for Object Detection. Electronics 2023, 12, 1354. https://doi.org/10.3390/electronics12061354

AMA Style

Fu X, Yuan Z, Yu T, Ge Y. DA-FPN: Deformable Convolution and Feature Alignment for Object Detection. Electronics. 2023; 12(6):1354. https://doi.org/10.3390/electronics12061354

Chicago/Turabian Style

Fu, Xiang, Zemin Yuan, Tingjian Yu, and Yun Ge. 2023. "DA-FPN: Deformable Convolution and Feature Alignment for Object Detection" Electronics 12, no. 6: 1354. https://doi.org/10.3390/electronics12061354

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DA-FPN: Deformable Convolution and Feature Alignment for Object Detection

Abstract

1. Introduction

2. Related Work

3. Our Approach

3.1. Deformable Convolutional Module

3.2. Feature Alignment Module

3.3. Bottom-Up Module

4. Experiments

4.1. Experimental Settings

4.2. Results

4.3. Ablation Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI