Detector Consistency Research on Remote Sensing Object Detection

Zhang, Yuanlin; Jin, Haiyan

doi:10.3390/rs15174130

Open AccessArticle

Detector Consistency Research on Remote Sensing Object Detection

by

Yuanlin Zhang

^*

and

Haiyan Jin

Shaanxi Key Laboratory for Network Computing and Security Technology, Department of Computer Science and Engineering, Xi’an University of Technology, No. 5 South Jinhua Road, Xi’an 710048, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(17), 4130; https://doi.org/10.3390/rs15174130

Submission received: 28 June 2023 / Revised: 29 July 2023 / Accepted: 19 August 2023 / Published: 23 August 2023

(This article belongs to the Section AI Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

:

Remote Sensing Image processing is a traditional research field, where RSI object detection is one of the most important directions. This paper focuses on an inherent problem of multi-stage object detection frameworks: the coupling error transmitting problem. In brief, because of the coupling method between the classifier and the regressor, the traditional multi-stage Detection frameworks tend to be fallible when encountering coarse object proposals. To deal with this problem, this article proposes a novel deep learning-based multi-stage object detection framework. Specifically, a novel network head architecture with a multi-to-one coupling method is proposed to avoid the coupling error of the traditional network head architecture. Moreover, it is found that the traditional network head architecture is more efficient than the novel network architecture when encountering fine object proposals. Considering this phenomenon, a proposal-consistent cooperation mechanism between the network heads is proposed. This mechanism makes the traditional network head and the novel network head develop each other’s advantages and avoid the disadvantages. Experiments with different backbone networks on three publicly available data sets have shown the effectiveness of the proposed method since mAP is proposed as 0.7% to 12.3% on most models and data sets.

Keywords:

remote sensing object detection; convolutional neural networks; consistent multi-stage detection

1. Introduction

Remote sensing object detection is the process of determining the location and the category of objects in optical remote sensing images. In recent years, a large number of remote sensing object detection methods [1,2] have been proposed based on the deep learning technique. As an important and practical research field, the object detection task is not only used in the detection of ships [3], airports [4], vehicles [5] and other objects, but is also widely used in object tracking [6], instance segmentation [7], caption generation [8] and many other fields. The detector consists of a position regressor and a category classifier. In the object detection task the coupling problem has always been a concern in the object detection task.

Recently, many object detection methods have been proposed. In these methods, the mainstream methods are based on deep learning. As shown in Figure 1a, deep learning = based methods can be roughly divided into the following steps: feature extraction, region proposal, Region of Interest Pooling (RoIP), classification and regression. The region proposal is used to pre-generate regions where objects may exist. The RoIP is used to sample features in the pre-generated regions.

The above methods use a single network head (including classification and regression) to detect objects. The singularity of the network head makes the framework lack efficiency in both classification and regression [9]. In order to deal with this problem, many studies have been made on multi-stage object detection, which is defined as using multiple cascaded network heads to improve the accuracy of the bounding boxes. The framework of the multi-stage object detection is shown in Figure 1b. Some examples are shown as follows. Li et al. [9] proposed a group recursive learning network consisting of three cascaded network heads: weakly supervised object segmentation, object proposal generation and recursive detection refinement. Cai et al. [10] proposed cascade R-CNN with different Intersection over Union (IoU) thresholds to deal with the insufficient training samples problem on the later network heads.

However, the current multi-stage methods have an inherent problem that the coupling error is transmitted along multiple network heads [10,11,12]. In the aforementioned methods, the traditional way of classify-to-regress coupling is one-for-one exact matching. First, this one-for-one exact matching is likely to transmit classification errors to regression. As shown in Figure 1a,b, the network finally adopts the regressor corresponding to the classification result. Secondly, the false regression will make the input proposal of the RoIP deviate from the true object. Therefore, the RoIP cannot sample useful features, which further leads to the classification error in the next network head [11]. In other words, the regression error is transmitted to the next network head. Finally, the above two kinds of errors will be repeated between multiple network heads, resulting in the final detection result error. Apparently, the problem can cause iterative error transmitting in the multi-stage detection heads. It urgently needs to be dealt with for this issue.

In order to overcome the coupling error transmitting problem, the Consistent Multi-stage Detection (CMD) framework is proposed. As shown in Figure 1c, the proposed CMD framework consists of the following parts.

First, the proposed method introduced the conception of a robust coupling head and an efficient coupling head for coarse boxes and fine boxes, respectively. In contrast, the prior works only used the fine coupling head. Second, the proposed method adopts coupling mechanisms that are consistent with the change in boxes during multiple detection stages. The boxes tend to change from coarse to fine, thus the adopted coupling mechanism also keep this trend. Thus the proposed model is a relatively consistent method.

In summary, the main contributions of our work are as follows.

Concepts of a robust head and efficient head are proposed. Through various experiments, the functions of two kinds of coupling methods are validated. The robust coupling method is helpful for avoiding detection errors for coarse proposals. The efficient coupling method usually achieves better performance on fine proposals but worse performance on coarse proposals.
Fineness-consistent multi-head cooperation mechanisms are investigated between the robust coupling head and the efficient coupling head. These cooperation mechanisms are designed to be consistent with the coarse-to-fine trend of the object bounding boxes during the multi-stage detection process.
A novel network head architecture, consistent multi-stage detection, is proposed to deal with the coupling error transmitting problem by adopting an appropriate multi-head cooperation mechanism. Experiments with different backbone networks on three widely used remote sensing object detection data sets have shown the effectiveness of the proposed framework.

The rest of this paper is organized as follows. In Section 2, some existing object detection methods are illustrated. Section 3 describes the proposed method in detail. Finally, the experimental results and discussion are reported in Section 4, while the conclusion is made in Section 5.

2. Related Works

This section reviews major object detection methods that have made significant contributions to the Remote Sensing Image (RSI) object detection task. Taking the implementation of deep learning as a milestone, these RSI object detection methods are divided into two categories: traditional manual methods and neural network methods.

2.1. Traditional Manual Methods

The traditional manual methods take manually designed features for classification and localization. These traditional manual methods have made great progress and are divided into several series according to the main thoughts.

2.1.1. Low-Level Feature Methods

SIFT-based methods:

The first series is based on the Scale Invariant Feature Transform (SIFT) [13] feature descriptor, which extracts features of several selected points from a given image and compares them with features of selected points from a known object. SIFT-based methods are robust to object rotation, scaling and panning. Sedaghat et al. [14] proposed an improved SIFT to address the feature distribution problem of SIFT in multi-source RSIs. Li et al. [15] proposed scale-orientation join restriction criteria for better feature-matching performance among object key points in RSIs. This method enables SIFT to be robust to the scaling problem that is common in RSIs.

HOG-based methods:

The next series is based on the Histogram of Oriented Gradients (HOG) [16]. HOG features represent objects with orientation distributions and intensity distributions of an object’s spatial region gradient vectors. Tuermer et al. [17] proposed an integrated real-time processing chain using HOG features to classify regions in RSI vehicle detection. Grabner et al. [5] proposed an efficient online boosting algorithm for RSI vehicle detection based on the research on HOG. Cheng et al. [18] proposed an RSI object detection framework using a collection of part detectors (COPD), which uses HOG as low-level features.

2.1.2. Mid-Level Feature Methods

BoVW-based methods:

Another series is based on the Bag of Visual Words (BoVW) [19] model. The BoVW is an unsupervised recognition method that represents images by a collection of local regions. Later, Xu et al. implied the BoVW to object-based RSI classification as BoVW [20]. Sun et al. [21] proposed a sparse coding and BoVW-based rotation invariant method to deal with the complex shape problem of RSI object detection. Cheng et al. [22] proposed a BoVW and pLSA-based scene classification method to detect landslides from RSIs. Xia et al. [23] proposed an active clustering method to annotate RSIs with little expert knowledge.

SC-based methods:

The final series is based on Sparse Coding (SC). The main idea of SC is to represent the high-dimensional original data with a low-dimensional manifold that contains several structural primitives. Zhang et al. [24] proposed a Sparse Representation-Based Binary Hypothesis (SRBBH) for RSI object detection. Yokoya et al. [25] integrated local-feature sparse coding into a generalized Hough transform. Zhang et al. [26] proposed Sparse Transfer Manifold Embedding (STME), which can extract discriminative features from limited samples. Han et al. [27] proposed a multi-class RSI object detection method integrating visual saliency modeling and sparse coding.

2.2. Neural Network Methods

The neural network methods refer to the methods based on deep learning features. Compared with traditional manual methods, neural network methods contain little human interference [18]. The recent development of deep learning-based object detection methods are organized as: one-stage methods, two-stage methods and multi-stage methods.

2.2.1. One-Stage Methods

YOLO-v3 [28] proposed a state-of-the-art method in terms of its lightweight and high-speed characteristics. YOLOF [29] analyzed the success of FPN and proposed an efficient single-level feature method. SSD [30] added the Pyramid Feature Hierarchy with YOLO; that is, predict the target on the feature map of a different receptive field.

2.2.2. Two-Stage Methods

Girshick et al. [31] first proposed an object detection framework using R-CNN based on convolutional neural networks. To improve the speed of this model, Girshick proposed Fast R-CNN [32] and co-authored with Ren et al. to propose Faster R-CNN [33]. Cheng et al. [34] proposed a rotation-invariant and Fisher-discriminative CNN model to address object rotation. Liu et al. [35] proposed a network with oriented response modules to deal with small objects. Zhao et al. [36] proposed a Multi-scale Image block-level Fully Convolutional Neural Network (MIF-CNN) to deal with the fast detection in RSIs. Lu et al. [37] proposed a Gated and Axis-Concentrated Localization Network (GACL Net) to address small-object localization. Long et al. [38] proposed a Feature Fusion Deep Network (FFDN) to deal with small, partially obscured or out-of-view objects. Zhang et al. [39] proposed a Multi-Scale Feature Fusion Network (MS-FF Net) to address scale variation.

2.2.3. Multi-Stage Methods

Different from previous object detection methods, which mainly use a single network head, multi-stage object detection frameworks with multiple cascaded network heads are investigated. Li et al. [9] proposed a group recursive learning network consisting of three cascaded networks: weakly supervised object segmentation, object proposal generation and recursive detection refinement. Cai et al. [10] proposed a detection framework that consists of multiple cascaded detectors with different IoU thresholds.

3. Proposed Method

In dealing with the coupling error transmitting problem, the CMD framework is proposed with a robust network head to keep different stages network heads consistent with the roughness of their input proposals. In the following subsections, the CMD framework is illustrated from four aspects: the entire framework, the robust network head architecture, the proposal-consistent heads cooperation mechanism and the training loss function.

3.1. Entire Framework

As shown in Figure 2, the proposed CMD framework consists of several parts. First, CNN is used to extract hierarchical deep features of the input image. Secondly, the Feature Pyramid Network (FPN) is used to integrate layers with hierarchical semantic levels and different spatial scales. Thirdly, the Region Proposal Network (RPN) is used to generate proposals, which are expected to contain objects. Finally, several detection stages are implemented iteratively to improve detection accuracy. Each detection stage consists of two parts: the RoIP to sample features of proposals from the FPN features of the input image and the network head to give the detection results. There are two types of network heads, one of which is proposed in this article. Moreover, a proposal-consistent head cooperation mechanism between the two types of network heads is proposed.

3.1.1. CNN

As shown in Figure 2, CNN is the first part of the CMD framework. In the training time, cascaded convolutional layers of the CNN are trained to learn effective features. In the inference time, learned weights of the convolutional layers are implemented to extract features of the whole image. In detail, given an input image or a feature map

I = {I^{s}}_{s = 1}^{S}

with S channels, each convolutional layer computes an output

O = {O_{c}}_{c = 1}^{C}

with C channels by

O_{c} = K_{c} (I) = \times \sum_{s = 1}^{S} κ_{c}^{s} \otimes I^{s} + b_{c}

(1)

where

K_{c}

represents the convolutional layer with the weights of

[κ_{c}^{1}, κ_{c}^{2}, \dots, κ_{c}^{s}, \dots, κ_{c}^{S}]

and the biases of

b_{c}

. The symbol ⊗ represents the convolutional operation.

3.1.2. FPN

FPN is adopted in our framework to ensure feature efficiency on object detection. As shown in Figure 2, the FPN starts with several deep-layer CNN features. The FPN features on the right side of the dashed box are elementary, adding results of feature maps from two different layers. Before adding, the shallow layer feature is resized through a 1 × 1 convolutional layer, and the deep layer feature is upsampled to the same size as the shallow layer. Finally, several integrated feature maps with different spatial scales are output for the use of the rest of the framework.

3.1.3. RPN

Given a feature map of the image, anchors are built on each pixel with different width–height ratios and areas. Then a classifier is used to predict whether there is an object in an anchor. At the same time, a regressor is used to to predict the bounding box coordinates regression to get the anchor closer to the object, if there is one. Since there is more than one feature map generated by the FPN, the CMD framework consists of several RPNs, as many as the FPN feature maps.

3.1.4. RoIP

RoIP receives proposals from RPNs or network heads and sample features of the proposals from FPN features of the whole image. Obviously, different scale features are sampled for each single proposal, making the following detection with network heads more accurate.

3.1.5. Network Head

A network head refers to a couple of a regressor and a classifier. The network head receives the feature map of a proposal from RoIP. Then the regressor takes this proposal feature map to predict the bounding box coordinate’s regression. The classifier takes this proposal feature map to predict the proposal category. Apparently, the proposal may not perfectly contain the object, causing a possible classification error.

3.2. Robust Network Head Architecture

3.2.1. Robust Network Head

As described in Section 1, the traditional network head is likely to cause the coupling error transmitting problem. Therefore, a network head with a novel coupling method that can avoid transmitting is highly demanded.

As shown in Figure 3a, the traditional efficient network head adopts the category prediction to decide which element of the regression list to use. This coupling method is likely to convey classification errors to the regression branch. In order to deal with this coupling problem, a robust network head is proposed, as shown in Figure 3b, which only has a single general regressor for all categories.

This robust network head is helpful to improve object detection performance for coarse input proposals whose bounding box is not close enough to the true object. The reason is explained as follows. First, during the inference time, the single regressor will not be affected by the classifier of the same stage. The regression results are related only to the convolutional proposal features and the regressor itself. Second, in the robust network head, the single regressor is trained with samples of different classes. Therefore, the network head can robustly improve the location accuracy of different class-bounding boxes. Finally, through blocking both of the detection error broadcasting paths, namely broadcasting through the regression branch and the classification branch, this robust network head is able to improve the model performance.

However, the robust network head does not perform efficiently enough on fine input proposals whose bounding box is close to the true object. The reason is explained as follows. First, with fine input proposals, the traditional regressor of the traditional efficient network head can get less affected by the coupling error. Secondly, because of the multi-to-one coupling method between the general regressor and the classifier, the robust network head cannot regress specifically for the shape of each category.

For relatively coarse input proposals, experiments (see Section 4) proved that the robust network head shows better performance than the traditional network head.

3.2.2. Efficient Network Head

Although the robust network head can avoid the coupling error well, it still has some limitations. Therefore, an analysis of the traditional network head is also needed.

This efficient network head is more efficient than the robust network head for fine input proposals. The reasons are listed as follows. First, under the precondition of relatively fine input proposals, a current stage classifier is relatively reliable. Thus the coupling error, which is caused by classifiers misleading the regressor through a one-to-one coupling method, can be less important for detection performance. Second, considering that different categories of objects have various fine spatial characteristics, the traditional regressor in the efficient network head can specifically adjust to these various fine spatial characteristics of different categories. In other words, the general regressor in the robust network head has to consider fine spatial characteristics of different category objects and thus cannot be sensitive to differences among different categories. In summary, with the fine input proposals, coupling error can be ignored, and the traditional regressor can adjust to different categories. Therefore, the traditional efficient network head is more appropriate for fine input proposals.

For relatively fine input proposals, experiments (see Section 4) proved that the efficient network head shows better performance than the robust network head.

3.3. Proposal-Consistent Heads Cooperation Mechanism

According to Section 3.2, the robust network head and the efficient network head are respectively suitable for coarse input proposals and fine input proposals. However, as shown in Figure 4a, the traditional multi-stage detection adopts efficient network heads for all stages, neglecting the different roughnesses of different stages’ input proposals. Apparently, there should be a cooperation mechanism between the robust network head and the efficient network head to find the optimal detection performance. Thus, in this section, a proposal-consistent head cooperation mechanism is proposed.

3.3.1. Coarse-to-Fine Proposals

In multi-stage detection frameworks, proposals are growing from coarse to fine. Through a detection network head, the majority of input proposals are optimized [10]. Therefore, after several detection network heads, the majority of the positive proposals are closer to the real objects. In summary, there is a coarse-to-fine trend of the proposals among different stages of the multi-stage detection frameworks.

3.3.2. Heads Cooperation Mechanism

As shown in Figure 4b, in accordance with the coarse-to-fine trend of the proposals, robust network heads are implemented before efficient network heads in our framework. Since the very start, proposals are still too coarse for the efficient network heads, robust network heads are adopted to avoid the coupling error transmitting problem. Then, until the proposals are fine enough, efficient network heads are used to provide better performance through category-sensitive regression. In summary, the proposed method adopts a network head cooperation mechanism that is consistent with the coarse-to-fine trend of different stages proposals.

3.3.3. Comparison and Advantages

In the traditional multi-stage detection framework, efficient proposals at the early-stage network heads cause serious coupling errors. As shown in Figure 4a, these coupling errors make a part of positive proposals falsely regress to new positions, including no real objects. These falsely regressed proposals are then classified by the next-stage efficient network head as negative samples, which are never considered again in the rest of the stages. As a result, positive proposals are finally lost.

In the CMD framework, robust proposals at the early-stage network heads effectively avoid coupling errors, although they are not efficient enough in optimizing proposals. As shown in Figure 4b, most of the positive proposals are properly regressed to new positions, including real objects. Then the efficient network heads are then adopted to efficiently optimize these fine proposals to the real objects.

Through the comparison, it can be found that the proposal-consistent head cooperation mechanism can make better use of the advantages of the two network head architectures to make up for each other’s disadvantages. With this well-designed detection framework, CMD, the coupling error transmitting problem can be well addressed.

3.4. Training Loss Function

Since the multi-stage detection framework consists of several stages, each of which contains a classifier and a regressor, the loss function of the proposed framework is a hierarchical weighted sum of several losses.

3.4.1. Classification Loss

First, the category label and the category prediction of a bounding box are, respectively, defined as

{\vec{p}}^{*} = {{p_{c}}^{*} | c = 0, 1, 2, \dots, C}

and

\vec{p} = {p_{c} | c = 0, 1, 2, \dots, C}

. Second,

C r o s s E n t r o p y L o s s^{i}

represents the cross-entropy loss for an object proposal out of I object proposals:

\begin{matrix} C r o s s E n t r o p y L o s s^{i} ({\vec{p}}^{*}, \vec{p}) = \sum_{c = 1}^{C} p_{c} log ({p_{c}}^{*}) \end{matrix}

(2)

Finally, the classification loss function of the

k^{t h}

stage is represented as:

\begin{matrix} L o s s_{c l s}^{k} = \frac{1}{I} \sum_{i = 1}^{I} C r o s s E n t r o p y L o s s^{i} \end{matrix}

(3)

3.4.2. Regression Loss

First, the regression label of a bounding box is defined as

{\vec{t}}^{*} = {{t_{l}}^{*} | l = x, y, w, h}

. Meanwhile, the regression prediction has two forms,

{\vec{t}}_{r b}

under the robust network head and

T_{e f} = {{\vec{t}}_{c} | c = 0, 1, 2, \dots, C}

under the efficient network head. Second,

S m o o t h L_{1} L o s s^{i}

represents the regression loss function of

s m o o t h_{L 1}

[32] for an object proposal out of I object proposals:

\begin{matrix} S m o o t h L_{1} L o s s^{i} = \{\begin{matrix} \sum s m o o t h_{L 1} ({\vec{t}}^{*} - {\vec{t}}_{r b}) \\ \sum_{c = 1}^{C} σ_{c} \sum s m o o t h_{L 1} ({\vec{t}}^{*} - {\vec{t}}_{c}) \end{matrix} \end{matrix}

(4)

while

σ_{c}

is formulated as follows:

\begin{matrix} σ_{c} = \{\begin{matrix} 1, c = \arg \max (\vec{p}) \\ 0, e l s e \end{matrix} \end{matrix}

(5)

Finally, the regression loss function of the

k^{t h}

stage is represented as:

\begin{matrix} L o s s_{r e g}^{k} = \frac{1}{I} \sum_{i = 1}^{I} S m o o t h L_{1} L o s s^{i} \end{matrix}

(6)

3.4.3. CMD Loss

While

τ

represents a fixed-loss weighting parameter of regression loss for all stages, and

λ_{k}

represents loss weighting parameters of different stages, the final loss function is formulated as:

\begin{matrix} C M D_L o s s = \sum_{k} λ_{k} (L_{c l s}^{k} + τ L_{r e g}^{k}) \end{matrix}

(7)

4. Results and Analysis

In this section, we sequentially introduce our experiments from four different aspects: data set description, evaluation metrics, implementation details and experimental results. Details of these parts are illustrated as follows.

4.1. Data Set Description

To evaluate our localization model, experiments are implemented on three remote sensing object detection data sets: DIOR [40], HRRSD [41] and NWPU VHR-10 [18]. Details of these data sets are introduced below.

4.1.1. DIOR

The DIOR data set is the most recently proposed remote sensing object detection data set, which is a large-scale benchmark data set. This data set contains 23,463 optical remote sensing images. In total, 190,288 instances of 20 class objects are distributed in these images.

Some examples of DIOR are shown in Figure 5. In this data set, objects of the same class have different sizes, which can increase the difficulty of detection [40]. Moreover, object instance numbers and class numbers are abundant. Therefore, DIOR is a large-scale and hard data set.

4.1.2. HRRSD

The HRRSD data set is a balanced, which means that object instances of different classes have similar quantities, remote sensing object detection data set. Moreover, this data set is also a large-scale data set containing 21,761 optical remote sensing images. In total, 55,740 instances of 13 class objects are distributed in these images.

Some examples of HRRSD are shown in Figure 6. In this data set, the object numbers are balanced for different classes. This balanced data set makes the training samples of each class sufficient [41]. Moreover, this data set contains a large number of object instances. Therefore, the HRRSD is a large-scale and balanced data set.

4.1.3. NWPU VHR-10

The DIOR data set is a historical remote sensing object detection data set, which was proposed in 2014. This data set contains 800 optical remote sensing images. In total, 800 instances of 10 class objects are distributed in these images.

Some examples of NWPU VHR-10 are shown in Figure 7. NWPU VHR-10 is a small-scale data set, which was one of the earliest remote sensing object detection data sets.

4.2. Evaluation Metrics

In order to quantitatively validate the efficiency of the proposed method, we adopt the most-used evaluation metrics in object detection: Average Precision (AP) for each class and mean AP (mAP) for each data set. Moreover, Intersection over Union (IoU) is used for location evaluation of objects.

4.2.1. AP and mAP

To clearly explain AP, it is necessary to introduce the confusion matrix. The confusion matrix is shown in Table 1. As shown in this table, TP represents the true positive object that is correctly predicted as positive. Similarly, FP represents the false positive object that is falsely predicted as positive, FN represents the false negative object that is falsely predicted as negative and TN represents the true negative object that is correctly predicted as negative. Then formulas for precision and recall are represented as:

P r e c i s i o n = \frac{TP}{TP + FP}

(8)

R e c a l l = \frac{TP}{TP + FN}

(9)

For a single object category, a Precision-Recall Curve (PRC) is drawn. For PRC, the vertical axis and the horizontal axis are the

P r e c i s i o n

and

R e c a l l

, respectively. Since both the

P r e c i s i o n

and the

R e c a l l

are smaller than one, the area under PRC is represented as the Average Precision (AP), a synthetic measurement considering both the

P r e c i s i o n

and the

R e c a l l

.

After the illustration of AP, it is easy to understand that the mean AP (mAP) is the mean value of the APs of all categories in a data set.

4.2.2. IoU

For a single predicted box, both the category and the location are used for the assessment of detection results. In this evaluation process, the location accuracy of an object is measured by IoU.

While

I n t

and

U n i

represent the intersection area and the union area between the predicted bounding box and the label bounding box, respectively, IoU is represented as

IoU = \frac{Int}{Uni}

(10)

To be noted, if a predicted bounding box does not have an intersection area with any label-bounding boxes, the predicted bounding box is viewed as background.

4.3. Implementation Details

In this part, the implementation details of the experiments are illustrated from four aspects: experimental environment, preprocessing, training parameters and CMD parameters.

4.3.1. Experimental Environment

The following experiments are conducted on a server with seven NVIDIA Titan X GPU. A comprehensive toolbox named MMdetection [42] is used in our experiment, with the software environment of CUDA 10.1, CUDNN 7.6.3, gcc-4.8, g++-4.9 and Pytorch 1.2.

4.3.2. Preprocessing

The images are first resized and paddled, with a paddle size of 32 to (1333, 800). Moreover, half of the randomly selected training examples are flipped. Finally, the images are normalized with statistical information, including average values and standard deviations for three channels (red, green and blue) of the current data set.

4.3.3. Training Parameters

First of all, the experiments are trained rather than fine-tuned. The training is implemented with a batch size of 16 for 12 epochs. In each epoch, all samples of the training set are used only once. Second, the stochastic gradient descent optimizer is set as: a learning rate of

0.02

, a weight decay of

10^{- 4}

and a momentum of

0.9

. Third, the RPNs are trained on samples without the ground-truth bounding boxes, which have IoU scores of

100 %

. All the training samples are obtained with anchors generated on each pixel. The anchors on each pixel are with an aspect ratio of

0.5

,

1.0

and

2.0

. Finally, 256 anchors are randomly selected for RPN training, and 512 RPN proposals are randomly selected for detector training.

4.3.4. CMD Parameters

Since the CMD contains multiple detectors, the training of CMD involves the IoU settings of each detector. In the proposed method, the IoU of each detector is set to

0.5

in accordance with most current detectors.

4.4. Experimental Results

In this part, validation experiments are first implemented to show the function of robust detectors and fine detectors. After that, the CMD-based framework is compared with other methods in the contrast experiments.

4.4.1. Validation Experiments

To validate the function of robust detector and fine detector, CMDs with different structures are investigated on three data sets. In all the validation experiments, each of the different CMDs consists of three cascaded detectors. Each detector can be a robust detector, represented as “R”, or a fine detector, represented as “F”. According to the selection and order of “R” and “F”, four different structures of CMDs are implemented: CMD-RRR, CMD-RRF, CMD-RFF and CMD-FFF.

DIOR Validation: As shown in Table 2, the four types of CMDs are implemented on the DIOR data set. CMD-RRR, with three robust detectors, reaches the highest mAP of 60.0%. CMD-FFF, with three fine detectors, achieves the lowest mAP of 58.4%. Moreover, CMDs with more fine detectors tend to have better performance.
HRRSD Validation: As shown in Table 3, the four types of CMDs are implemented on the HRRSD data set. CMD-RRF, with two robust detectors and one fine detector, reaches the highest mAP of 88.7%. CMD-RFF, with one fine detector and two fine detectors, reaches the same mAP as CMD-RRF. CMD-FFF, with three fine detectors, reaches the lowest mAP of 88.1%.
NWPU VHR-10 Validation: As shown in Table 4, the four types of CMDs are implemented on the NWPU VHR-10 data set. CMD-RRR, with three robust detectors, reaches the highest mAP of 82.3%. The CMD-FFF, with three fine detectors, reaches the lowest mAP of 72.3%. Moreover, CMDs with more fine detectors tend to have better performance.

Table 2. Validation Experiments on DIOR (%).

class	Sp	ST	Sd	Bg	Vc	Ap	TC	WM	Dm	GF	BF
CMD-RRR	73.3	65.1	37.7	33.8	44.7	66.5	81.3	75.6	46.2	71.1	63.6
CMD-RRF	74.2	69.6	44.2	34.1	44.9	63.5	80.4	73.1	44.5	71.0	64.5
CMD-RFF	73.6	64.6	42.3	34.3	44.1	62.0	80.4	75.1	43.9	69.2	64.3
CMD-FFF	73.4	64.1	41.1	32.2	42.6	58.4	80.9	74.0	41.7	68.4	65.3
	BC	GTF	ESA	Hb	ETS	Op	Cn	Ar	TS	mAP
CMD-RRR	85.4	74.6	52.5	44.6	45.2	52.5	75.9	61.6	49.7	60.0
CMD-RRF	86.1	72.1	52.5	45.0	46.9	52.6	76.3	58.2	49.0	59.9
CMD-RFF	86.7	72.5	51.6	43.4	46.5	52.8	75.7	59.7	48.6	59.6
CMD-FFF	84.9	69.6	49.0	40.8	46.7	50.7	77.0	61.9	44.8	58.4

Best performance of each class is in bold. Sp: ship, ST: storage tank, Sd: stadium, Bg: bridge, Vc: vehicle, Ap: airport, TC: tennis court, WM: windmill, Dm: dam, GF: golf field, BF: baseball field, BC: basketball court, GTF: ground track field, ESA: expressway service area, Hb: harbor, ETS: expressway toll station, Op: overpass, Cn: chimney, Ar: airplane, TS: train station.

Table 3. Validation Experiments on HRRSD (%).

class	Sp	Bg	GTF	ST	BC	TC	Ar	BD	Hb	Vc	CR	TJ	PL	mAP
CMD-RRR	91.5	87.3	97.7	95.7	68.2	93.1	98.1	91.5	93.7	94.8	92.5	77.4	66.7	88.3
CMD-RRF	92.8	88.2	97.9	95.9	69.3	93.0	98.5	90.9	94.2	92.9	92.4	77.7	67.9	88.7
CMD-RFF	91.6	88.3	97.7	95.5	69.3	93.7	98.5	91.0	95.1	95.0	93.2	78.1	66.8	88.7
CMD-FFF	92.2	86.4	97.5	95.3	68.8	93.3	98.6	90.7	94.4	95.7	92.0	75.5	65.5	88.1

Best performance of each class is in bold. Sp: ship, Bg: bridge, GTF: ground track field, ST: storage tank, BC: basketball court, TC: tennis court, Ar: airplane, Bd: baseball diamond, Hb: harbor, Vc: vehicle, CR: cross-road, TJ: T-junction, PL: parking lot.

Table 4. Validation Experiments on NWPU VHR-10 (%).

class	Sp	Bg	GTF	ST	BC	TC	Ar	BD	Hb	Vc	mAP
CMD-RRR	93.3	42.8	85.3	95.4	72.4	80.6	99.4	96.8	69.7	87.0	82.3
CMD-RRF	89.9	24.5	76.0	96.6	50.8	79.1	99.2	96.6	58.4	85.4	75.7
CMD-RFF	90.9	37.2	70.1	96.3	71.5	77.7	99.6	97.2	60.3	86.2	78.7
CMD-FFF	84.2	19.2	54.3	95.7	62.2	75.3	99.1	96.4	51.4	85.5	72.3

Best performance of each class is in bold. Sp: ship, Bg: bridge, GTF: ground track field, ST: storage tank, BC: basketball court, TC: tennis court, Ar: airplane, Bd: baseball diamond, Hb: harbor, Vc: vehicle.

According to these results, we can find two clues for the analysis of the robust detector and the fine detector. Among the four CMDs, CMD-FFF always has the worst performance on different data sets, CMD-RRF and CMD-RFF reach the best performance on HRRSD and CMD-RRR reaches the best performance on DIOR and NWPU VHR-10. Details of the analyses are illustrated as follows.

First, among the four CMDs, CMD-FFF always achieves the worst performance on different data sets. Obviously, CMD-RFF can always surpass CMD-FFF; in other words, replacing the first fine detector in CMD-FFF with a robust detector can steadily improve model performance. This phenomenon implies that the robust detector works better in the earlier stages, with relatively coarse bounding boxes as input.

Second, among the four CMDs, CMD-RRF and CMD-RFF reach the best performance on HRRSD. To be noted, the CMD-FFF mAP on HRRSD reaches a high score of nearly 90%. This means that the RPN trained on HRRSD, a balanced large-scale data set, provides proposals fine enough to use three fine detectors. Under this condition, CMD-RRF or CMD-RFF shows better performance than CMD-RRR. This phenomenon implies that the fine detector works better in the later stages, with relatively fine bounding boxes.

Third, among the four CMDs, CMD-RRR reaches the best performance on DIOR and NWPU VHR-10. To be noted, the CMD-FFF mAPs on DIOR and NWPU VHR-10 cannot surpass 75%. This means that the RPNs trained on DIOR, a hard large-scale data set, and NWPU VHR-10, a small-scale data set, provide proposals too coarse to use three fine detectors. Under this condition, CMD-RRR shows better performance than CMD-RRF or CMD-RFF. Compared with validation experiments on HRRSD, where RPN is trained to provide fine proposals, we can find that more robust detectors are preferred if coarse proposals are provided, and more fine detectors are preferred if fine proposals are provided. This conclusion can also be inferred from the performance (mAP) order of CMDs in Table 2 and Table 4: CMD-RRR > CMD-RRF > CMD-RFF > CMD-FFF.

4.4.2. Ablation Experiments

To evaluate model effectiveness, ablation experiments are conducted. In Exp. 2, the experiment without (w/o) the RFF schedule denotes a normal three-stage network with normal heads, i.e., the FFF schedule. Then, in Exp. 3 and 4, the three cascaded detection heads are reduced. As shown in Table 5, the proposed method shows better performance than the ablation experiments, Exp. 2 to 4. This validates the effectiveness of the proposed model and schedule.

4.4.3. Contrast Experiments

To evaluate the proposed CMD-based object detection framework, some other state-of-the-art frameworks are adopted as contrast methods, the majority of which are illustrated as follows:

R-CNN [31]: Region-based Convolutional Neural Network (R-CNN) is the first deep feature-based object detection framework, which consists of several progresses: proposal generation, region cropping, feature extraction, classification and regression.
RICNN [43]: Rotation-Invariant Convolutional Neural Network (RICNN) proposed a new rotation-invariant layer with the Alexnet CNN as the backbone network.
FasterR-CNN_r101 [33]: Faster R-CNN is proposed on the base of R-CNN and can simultaneously improve both the speed and accuracy of detection through the optimization and integration of different parts. This framework takes the 101-layer ResNet [44] for feature extraction.
FasterR-CNN_r50/_r101+FPN: These are compound object detection frameworks from [42]. These frameworks take the ResNet [44] with 50 or 101 layers for feature extraction and adopt the Feature Pyramid Network (FPN) [45] to enhance the extracted framework.
YOLO-v3 [28]: Yolov3 is an incremental improvement proposed by a state-of-the-art method in terms of its lightweight and high-speed characteristics.
YOLOF [29]: The You Only Look One-level Feature analyzed the success of FPN and proposed an efficient single-level feature method.
Dynamic R-CNN [46]: The Dynamic R-CNN: Toward high-quality object detection via dynamic training designed a dynamic adjustment scheme for model parameters.

After introducing the contrast methods, the proposed methods are represented with the form of FasterR-CNN_r{N} + FPN + CMD - {D}, where N and D are optional from the sets of {50, 101} and {RRR, RFF}, respectively. Contrast experiments between the proposed methods and contrast methods on different data sets are conducted as follows.

DIOR Contrast: As shown in Table 6, different methods are implemented on the DIOR data sets. The proposed CMD-RRR on FasterR-CNN_r101+RPN reaches the best performance, with a mAP of 61.6%. The proposed CMD-RRR improves the mAP on FasterR-CNN_r50+RPN by 1.9% and on FasterR-CNN_r101+RPN by 1.1%. The proposed CMD-RFF improves the mAP on FasterR-CNN_r50+RPN by 1.5% and on FasterR-CNN_r101+RPN by 0.8%. Apparently, CMD-RRR shows better performance than CMD-RFF.
HRRSD Contrast: As shown in Table 7, different methods are implemented on the DIOR data sets. The proposed CMD-RFF on FasterR-CNN_r101+RPN reaches the second-best performance, with a mAP of 90.3%. The proposed CMD-RRR improves the mAP on FasterR-CNN_r50+RPN by 2.3% and on FasterR-CNN_r101+RPN by −1.6%. The proposed CMD-RFF improves the mAP on FasterR-CNN_r50+RPN by 2.7% and on FasterR-CNN_r101+RPN by −0.9%. Apparently, CMD-RFF shows better performance than CMD-RRR.
NWPU VHR-10 Contrast: As shown in Table 8, different methods are implemented on the DIOR data sets. The proposed CMD-RRR on FasterR-CNN_r101+RPN reaches the best performance, with a mAP of 87.4%. The proposed CMD-RRR improves the mAP on FasterR-CNN_r50+RPN by 12.3% and on FasterR-CNN_r101+RPN by 4.9%. The proposed CMD-RFF improves the mAP on FasterR-CNN_r50+RPN by 8.7% and on FasterR-CNN_r101+RPN by 3.0%. Apparently, CMD-RRR shows better performance than CMD-RFF.

Table 6. Contrast Experiments on DIOR (%).

class	Sp	ST	Sd	Bg	Vc	Ap	TC	WM	Dm	GF	BF
R-CNN [31]	9.1	18.0	60.8	15.6	9.1	43.0	54.0	16.4	33.7	50.1	53.8
RICNN [43]	9.1	19.1	61.1	25.3	11.4	61.0	63.5	31.5	41.1	55.9	60.1
FasterR-CNN [33]	27.7	39.8	73.0	28.0	23.6	49.3	75.2	45.4	62.3	68.0	78.8
RIFD-CNN [47]	31.7	41.5	73.6	29.0	28.5	53.2	79.5	46.9	63.1	68.9	79.9
ssD [30]	59.2	46.6	61.0	29.7	27.4	72.7	76.3	65.7	56.6	65.3	72.4
YOLov3 [28]	87.4	68.7	70.6	31.2	48.3	29.2	87.3	78.7	26.9	31.1	74.0
FasterR-CNN_r50+FPN [42]	73.3	63.2	47.8	32.0	43.2	54.2	80.5	74.4	37.1	65.6	66.0
FasterR-CNN_r50+FPN+CMD-RRR	73.3	65.1	37.7	33.8	44.7	66.5	81.3	75.6	46.2	71.1	63.6
FasterR-CNN_r50+FPN+CMD-RFF	73.6	64.6	42.3	34.3	44.1	62.0	80.4	75.1	43.9	69.2	64.3
FasterR-CNN_r101+FPN [42]	72.7	61.6	51.6	34.9	43.0	61.5	80.0	75.4	48.9	71.0	64.7
FasterR-CNN_r101+FPN+CMD-RRR	72.9	63.3	46.8	35.9	44.3	70.5	81.4	74.4	52.2	73.5	64.5
FasterR-CNN_r101+FPN+CMD-RFF	72.9	62.4	41.5	37.1	44.2	67.4	80.3	75.6	52.7	74.4	64.3
class	BC	GTF	ESA	Hb	ETS	Op	Cn	Ar	TS	mAP
R-CNN [31]	62.3	49.3	50.2	39.5	33.5	30.9	53.7	35.6	36.1	37.7
RICNN [43]	66.3	58.9	51.7	43.5	36.6	39.0	63.3	39.1	46.1	44.2
FasterR-CNN [33]	66.2	56.9	69.0	50.2	55.2	50.1	70.9	53.6	38.6	54.1
RIFD-CNN [47]	69.0	62.4	69.0	51.2	56.0	51.1	71.5	56.6	40.1	56.1
ssD [30]	75.7	68.6	63.5	49.4	53.1	48.1	65.8	59.5	55.1	58.6
YOLov3 [28]	78.6	61.1	48.6	44.9	54.4	49.7	69.7	72.2	29.4	57.1
FasterR-CNN_r50+FPN [42]	85.5	70.8	49.6	38.0	46.5	49.9	76.9	62.6	44.2	58.1
FasterR-CNN_r50+FPN+CMD-RRR	85.4	74.6	52.5	44.6	45.2	52.5	75.9	61.6	49.7	60.0
FasterR-CNN_r50+FPN+CMD-RFF	86.7	72.5	51.6	43.4	46.5	52.8	75.7	59.7	48.6	59.6
FasterR-CNN_r101+FPN [42]	86.0	72.0	53.6	42.8	50.0	52.5	77.8	58.0	52.0	60.5
FasterR-CNN_r101+FPN+CMD-RRR	85.3	73.7	55.0	46.2	48.8	54.3	76.7	55.4	57.5	61.6
FasterR-CNN_r101+FPN+CMD-RFF	86.0	71.7	54.3	47.0	51.5	55.0	77.1	56.2	53.6	61.3

Best performance of each class is in bold. Sp: ship, ST: storage tank, Sd: stadium, Bg: bridge, Vc: vehicle, Ap: airport, TC: tennis court, WM: windmill, Dm: dam, GF: golf field, BF: baseball field, BC: basketball court, GTF: ground track field, ESA: expressway service area, Hb: harbor, ETS: expressway toll station, Op: overpass, Cn: chimney, Ar: airplane, TS: train station.

Table 7. Contrast Experiments on HRRSD (%).

class	Sp	Bg	GTF	ST	BC	TC	Ar
R-CNN [31]	49.7	20.0	76.3	79.1	18.0	70.8	77.5
RICNN [43]	56.5	27.4	78.0	81.0	23.0	66.4	78.1
FasterR-CNN [33]	88.5	85.5	90.6	88.7	47.9	80.7	90.8
YOLO-v3 [28]	83.7	88.1	96.1	92.9	53.5	87.1	96.7
YOLOF [29]	90.1	89.2	96.7	90.6	61.0	89.9	97.3
Dynamic R-CNN [46]	88.9	89.2	97.2	93.4	69.0	92.7	96.9
FasterR-CNN_r50+FPN [42]	89.8	86.2	97.3	94.5	60.4	89.1	98.7
FasterR-CNN_r50+FPN+CMD-RRR	91.5	87.3	97.7	95.7	68.2	93.1	98.1
FasterR-CNN_r50+FPN+CMD-RFF	91.6	88.3	97.7	95.5	69.3	93.7	98.5
FasterR-CNN_r101+FPN [42]	94.4	89.6	97.9	96.5	77.6	96.4	97.8
FasterR-CNN_r101+FPN+CMD-RRR	92.9	89.8	98.4	95.6	70.6	93.1	98.4
FasterR-CNN_r101+FPN+CMD-RFF	93.2	90.4	98.4	95.9	71.6	93.7	98.1
class	BD	Hb	Vc	CR	TJ	PL	mAP
R-CNN [31]	57.6	54.1	41.3	25.9	2.4	16.6	45.3
RICNN [43]	59.6	47.8	52.0	26.6	9.3	20.5	48.2
FasterR-CNN [33]	86.9	89.4	84.0	88.6	75.1	63.3	81.5
YOLO-v3 [28]	89.1	95.0	81.1	91.5	75.0	53.1	83.3
YOLOF [29]	91.8	95.0	87.0	93.2	79.7	63.8	86.6
Dynamic R-CNN [46]	93.1	95.4	91.4	92.7	76.6	64.0	87.7
FasterR-CNN_r50+FPN [42]	90.4	94.3	92.6	90.2	71.9	63.0	86.0
FasterR-CNN_r50+FPN+CMD-RRR	91.5	93.7	94.8	92.5	77.4	66.7	88.3
FasterR-CNN_r50+FPN+CMD-RFF	91.0	95.1	95.0	93.2	78.1	66.8	88.7
FasterR-CNN_r101+FPN [42]	93.8	96.0	96.1	92.2	84.0	72.8	91.2
FasterR-CNN_r101+FPN+CMD-RRR	91.8	94.3	95.3	93.1	81.1	70.3	89.6
FasterR-CNN_r101+FPN+CMD-RFF	93.3	95.8	95.6	93.5	83.2	71.5	90.3

Best performance of each class is in bold. Sp: ship, Bg: bridge, GTF: ground track field, ST: storage tank, BC: basketball court, TC: tennis court, Ar: airplane, Bd: baseball diamond, Hb: harbor, Vc: vehicle, CR: cross-road, TJ: T-junction, PL: parking lot.

Table 8. Contrast Experiments on NWPU (%).

class	Sp	Bg	GTF	ST	BC	TC	Ar	BD	Hb	Vc	mAP
R-CNN [31]	63.7	45.4	81.2	84.3	46.8	35.5	70.1	83.6	62.3	44.8	61.8
RICNN [43]	77.3	61.5	86.7	85.3	58.5	40.8	88.4	88.1	68.6	71.1	72.6
FasterR-CNN [33]	89.9	80.9	100.0	67.3	87.5	78.6	90.7	89.2	89.8	88.0	86.2
OneshotDet-r50 [48]	90.4	53.1	97.0	96.9	41.3	9.5	7.3	2.9	74.8	56.0	52.92
CoAE-r50 [49]	87.0	79.23	100.0	88.2	75.2	4.1	9.1	19.47	94.5	77.86	63.47
SCoDANet-r50 [50]	90.0	86.2	99.8	90.8	79.7	6.75	18.5	16.81	87.1	77.66	65.33
FasterR-CNN_r50+FPN [42]	84.3	23.6	50.1	96.5	61.5	73.3	97.2	95.7	37.3	83.4	70.3
FasterR-CNN_r50+FPN+CMD-RRR	93.3	42.8	85.3	95.4	72.4	80.6	99.4	96.8	69.7	87.0	82.3
FasterR-CNN_r50+FPN+CMD-RFF	90.9	37.2	70.1	96.3	71.5	77.7	99.6	97.2	60.3	86.2	78.7
FasterR-CNN_r101+FPN [42]	91.4	45.4	79.1	96.1	76.0	76.1	99.7	97.5	76.0	87.3	82.5
FasterR-CNN_r101+FPN+CMD-RRR	94.0	53.1	92.8	96.8	80.5	81.9	99.9	97.8	85.9	91.1	87.4
FasterR-CNN_r101+FPN+CMD-RFF	92.1	56.6	83.8	97.3	80.0	80.2	99.7	97.5	79.3	88.5	85.5

Best performance of each class is in bold. Sp: ship, Bg: bridge, GTF: ground track field, ST: storage tank, BC: basketball court, TC: tennis court, Ar: airplane, Bd: baseball diamond, Hb: harbor, Vc: vehicle.

According to these experiments, it is found that the proposed method can steadily improve the detection performance on different models and most data sets. Moreover, the proposed method reaches state-of-the-art performance on different data sets. Therefore, the proposed method is an effective method for the remote sensing object detection task.

4.5. Failure Case Analysis

As shown in Table 2 and Table 3, the proposed schedules are not always effective. In other words, the FFF schedule sometimes reaches the best performance compared with other schedules. To be specific, failure cases include the categories {baseball field, chimney, airplane and train station} in DIOR and {airplane, harbor and vehicle} in HRRSD. The robust head ‘R’ can more easily handle low-quality proposals, but the accuracy is lower when proposals are more precise. The input proposal for the categories of failed cases is generally better. Thus the accuracy after using the R-header is lower in these cases. More studies are to be conducted in the future.

5. Conclusions

In this article, novel network detector schedules on multi-stage convolutional neural network frameworks are built for object detection in remote sensing images. First, the robust detectors and the fine detector are carefully defined to build the CMDs. Then, different network detector schedules are investigated to give suggestions for using robust and fine detectors. The conclusion of this investigation can be easily migrated to various other models. Experiments with different baselines on different data sets are conducted to probe that the robust detector and the fine detector are very appropriate for coarse and fine proposals, respectively. Finally, the proposed frameworks are compared with several state-of-the-art methods. These abundant and convincing experiments have shown that: First, the robust detector and the fine detector are, respectively, appropriate for coarse proposals and fine proposals; Second, the order and the numbers of the two types of detectors are designed in accordance with the coarse–fine degree of different stage input proposals. Finally, the CMD-based framework can steadily improve the model performance with various backbone networks on different data sets.

Author Contributions

Conceptualization, Y.Z.; methodology, Y.Z.; software, Y.Z.; validation, Y.Z. and H.J.; formal analysis, Y.Z.; investigation, Y.Z.; resources, Y.Z.; data curation, Y.Z.; writing—original draft preparation, Y.Z.; writing—review and editing, Y.Z. and H.J.; visualization, Y.Z.; supervision, Y.Z.; project administration, Y.Z.; funding acquisition, Y.Z. and H.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the National Natural Science Foundation of China under Grants 62201472 and 62272383).

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank the editors and the anonymous reviewers for their comments and suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

References

Xu, H.; Zheng, W.; Liu, F.; Li, P.; Wang, R. Unmanned Aerial Vehicle Perspective Small Target Recognition Algorithm Based on Improved YOLOv5. Remote Sens. 2023, 15, 3583. [Google Scholar] [CrossRef]
Körez, A.; Barışçı, N.; Çetin, A.; Ergün, U. Weighted ensemble object detection with optimized coefficients for remote sensing images. ISPRS Int. J. Geo-Inf. 2020, 9, 370. [Google Scholar] [CrossRef]
Tang, J.; Deng, C.; Huang, G.B.; Zhao, B. Compressed-domain ship detection on spaceborne optical image using deep neural network and extreme learning machine. IEEE Trans. Geosci. Remote Sens. (TGRS) 2014, 53, 1174–1185. [Google Scholar] [CrossRef]
Chen, F.; Ren, R.; Van de Voorde, T.; Xu, W.; Zhou, G.; Zhou, Y. Fast automatic airport detection in remote sensing images using convolutional neural networks. Remote Sens. 2018, 10, 443. [Google Scholar] [CrossRef]
Grabner, H.; Nguyen, T.T.; Gruber, B.; Bischof, H. On-line boosting-based car detection from aerial images. ISPRS J. Photogramm. Remote Sens. (P&RS) 2008, 63, 382–396. [Google Scholar]
Keuper, M.; Tang, S.; Andres, B.; Brox, T.; Schiele, B. Motion segmentation & multiple object tracking by correlation co-clustering. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 140–153. [Google Scholar]
Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Fei-Fei, L. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 1725–1732. [Google Scholar]
Lu, J.; Yang, J.; Batra, D.; Parikh, D. Neural baby talk. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Li, J.; Liang, X.; Li, J.; Wei, Y.; Xu, T.; Feng, J.; Yan, S. Multistage object detection with group recursive learning. IEEE Trans. Multimed. 2017, 20, 1645–1655. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
Yuan, Y.; Zhang, Y. OLCN: An optimized low coupling network for small objects detection. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Liu, E.; Zheng, Y.; Pan, B.; Xu, X.; Shi, Z. DCL-Net: Augmenting the Capability of Classification and Localization for Remote Sensing Object Detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 7933–7944. [Google Scholar] [CrossRef]
Lowe, D.G. Object recognition from local scale-invariant features. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Kerkyra, Greece, 20–27 September 1999; Volume 2, pp. 1150–1157. [Google Scholar]
Sedaghat, A.; Mokhtarzade, M.; Ebadi, H. Uniform robust scale-invariant feature matching for optical remote sensing images. IEEE Trans. Geosci. Remote Sens. (TGRS) 2011, 49, 4516–4527. [Google Scholar] [CrossRef]
Li, Q.; Wang, G.; Liu, J.; Chen, S. Robust scale-invariant feature matching for remote sensing image registration. IEEE Geosci. Remote Sens. Lett. (GRSL) 2009, 6, 287–291. [Google Scholar]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
Tuermer, S.; Kurz, F.; Reinartz, P.; Stilla, U. Airborne vehicle detection in dense urban areas using HoG features and disparity maps. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. (J-STARS) 2013, 6, 2327–2337. [Google Scholar] [CrossRef]
Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J. Photogramm. Remote Sens. (P&RS) 2014, 98, 119–132. [Google Scholar]
Li, F.F.; Perona, P. A bayesian hierarchical model for learning natural scene categories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, CA, USA, 20–25 June 2005; Volume 2, pp. 524–531. [Google Scholar]
Xu, S.; Fang, T.; Li, D.; Wang, S. Object classification of aerial images with bag-of-visual words. IEEE Geosci. Remote Sens. Lett. (GRSL) 2010, 7, 366–370. [Google Scholar]
Sun, H.; Sun, X.; Wang, H.; Li, Y.; Li, X. Automatic target detection in high-resolution remote sensing images using spatial sparse coding bag-of-words model. IEEE Geosci. Remote Sens. Lett. (GRSL) 2012, 9, 109–113. [Google Scholar] [CrossRef]
Cheng, G.; Guo, L.; Zhao, T.; Han, J.; Li, H.; Fang, J. Automatic landslide detection from remote-sensing imagery using a scene classification method based on BoVW and pLSA. Int. J. Remote Sens. (IJRS) 2013, 34, 45–59. [Google Scholar] [CrossRef]
Xia, G.S.; Wang, Z.; Xiong, C.; Zhang, L. Accurate annotation of remote sensing images via active spectral clustering with little expert knowledge. Remote Sens. 2015, 7, 15014–15045. [Google Scholar] [CrossRef]
Zhang, Y.; Du, B.; Zhang, L. A sparse representation-based binary hypothesis model for target detection in hyperspectral images. IEEE Trans. Geosci. Remote Sens. (TGRS) 2015, 53, 1346–1354. [Google Scholar] [CrossRef]
Yokoya, N.; Iwasaki, A. Object detection based on sparse representation and Hough voting for optical remote sensing imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. (J-STARS) 2015, 8, 2053–2062. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, L.; Tao, D.; Huang, X. Sparse transfer manifold embedding for hyperspectral target detection. IEEE Trans. Geosci. Remote Sens. (TGRS) 2014, 52, 1030–1043. [Google Scholar] [CrossRef]
Han, J.; Zhou, P.; Zhang, D.; Cheng, G.; Guo, L.; Liu, Z.; Bu, S.; Wu, J. Efficient, simultaneous detection of multi-class geospatial targets based on visual saliency modeling and discriminative learning of sparse coding. ISPRS J. Photogramm. Remote Sens. (P&RS) 2014, 89, 37–48. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Chen, Q.; Wang, Y.; Yang, T.; Zhang, X.; Cheng, J.; Sun, J. You only look one-level feature. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 13039–13048. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Boston, MA, USA, 7–12 June 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 7–12 December 2015; Volume 28, pp. 91–99. [Google Scholar]
Cheng, G.; Han, J.; Zhou, P.; Xu, D. Learning rotation-invariant and fisher discriminative convolutional neural networks for object detection. IEEE Trans. Image Process. (TIP) 2018, 28, 265–278. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Ma, L.; Wang, J. Detection of multiclass objects in optical remote sensing images. IEEE Geosci. Remote Sens. Lett. (GRSL) 2018, 16, 791–795. [Google Scholar] [CrossRef]
Zhao, W.; Ma, W.; Jiao, L.; Chen, P.; Yang, S.; Hou, B. Multi-scale image block-level f-cnn for remote sensing images object detection. IEEE Access 2019, 7, 43607–43621. [Google Scholar] [CrossRef]
Lu, X.; Zhang, Y.; Yuan, Y.; Feng, Y. Gated and axis-concentrated localization network for remote sensing object detection. IEEE Trans. Geosci. Remote Sens. (TGRS) 2020, 58, 179–192. [Google Scholar] [CrossRef]
Long, H.; Chung, Y.; Liu, Z.; Bu, S. Object detection in aerial images using feature fusion deep networks. IEEE Access 2019, 7, 30980–30990. [Google Scholar] [CrossRef]
Zhang, W.; Jiao, L.; Liu, X.; Liu, J. Multi-scale feature fusion network for object detection in vhr optical remote sensing images. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Yokohama, Japan, 28 July–2 August 2019; pp. 330–333. [Google Scholar]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. (P&RS) 2020, 159, 296–307. [Google Scholar]
Zhang, Y.; Yuan, Y.; Feng, Y.; Lu, X. Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection. IEEE Trans. Geosci. Remote Sens. (TGRS) 2019, 57, 5535–5548. [Google Scholar] [CrossRef]
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open mmlab detection toolbox and benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar]
Cheng, G.; Zhou, P.; Han, J. Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images. IEEE Trans. Geosci. Remote Sens. (TGRS) 2016, 54, 7405–7415. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Zhang, H.; Chang, H.; Ma, B.; Wang, N.; Chen, X. Dynamic R-CNN: Towards high quality object detection via dynamic training. In Proceedings of the ECCV 2020: Computer Vision European Conference, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 260–275. [Google Scholar]
Cheng, G.; Zhou, P.; Han, J. Rifd-cnn: Rotation-invariant and fisher discriminative convolutional neural networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2884–2893. [Google Scholar]
Li, X.; Zhang, L.; Chen, Y.P.; Tai, Y.W.; Tang, C.K. One-shot object detection without fine-tuning. arXiv 2020, arXiv:2005.03819. [Google Scholar]
Hsieh, T.I.; Lo, Y.C.; Chen, H.T.; Liu, T.L. One-shot object detection with co-attention and co-excitation. Adv. Neural Inf. Process. Syst. 2019, 32, 2725–2734. [Google Scholar]
Li, L.; Yao, X.; Cheng, G.; Xu, M.; Han, J.; Han, J. Solo-to-collaborative dual-attention network for one-shot object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–11. [Google Scholar] [CrossRef]

Figure 1. Difference among deep learning object detection frameworks. (a) Faster R-CNN represents the mainstream of deep learning object detection. (b) Cascade R-CNN represents multi-stage object detection. (c) CMD Net is the method proposed in this article.

Figure 2. Illustration of the CMD framework. The ellipses represent different parts of the framework. The ellipses named “Robust” and “Efficient” represent the (proposed) robust network head and the (traditional) efficient network head, respectively.

Figure 3. Contrast between the traditional efficient network head and the proposed robust network head. (a) The traditional efficient network head, whose regressors are selected by the classification prediction. (b) The proposed robust network head, whose single regressor is totally independent of the classifier and is general for different categories. Apparently, the robust network head adopts a novel coupling method and can avoid the coupling error from the classifier to the regressor.

Figure 4. Illustration of the CMD framework. (a) The traditional multi-stage detection framework. (b) The consistent multi-stage detection framework CMD. The ellipses represent different parts of the framework. The ellipses named “Robust” and “Efficient” represent the (proposed) robust network head and the (traditional) efficient network head, respectively.

Figure 5. Examples of the DIOR data set.

Figure 6. Examples of the HRRSD data set.

Figure 7. Examples of the NWPU VHR-10 data set.

Table 1. Confusion matrix.

	Label Positive	Label Negative
Predicted Positive	TP	FP
Predicted Negative	FN	TN

Table 5. Ablation Experiments on HRRSD (%).

class	Exp.	Sp	Bg	GTF	ST	BC	TC	Ar
FasterR-CNN_r50+FPN+CMD-RFF	1	91.6	88.3	97.7	95.5	69.3	93.7	98.5
w/o RFF schedual	2	92.2	86.4	97.5	95.3	68.8	93.3	98.6
w/o stage3	3	89.1	87.4	97.0	93.2	67.3	92.2	97.8
w/0 stage2&3	4	87.2	85.4	96.4	93.9	64.8	91.9	97.6
class	Exp.	BD	Hb	Vc	CR	TJ	PL	mAP
FasterR-CNN_r50+FPN+CMD-RFF	1	91.0	95.1	95.0	93.2	78.1	66.8	88.7
w/o RFF schedual	2	90.7	94.4	95.7	92.0	75.5	65.5	88.1
w/o stage3	3	91.9	95.1	91.1	92.0	76.8	63.6	87.3
w/0 stage2&3	4	91.7	93.0	90.7	90.8	75.0	60.4	86.1

Best performance of each class is in bold. Sp: ship, Bg: bridge, GTF: ground track field, ST: storage tank, BC: basketball court, TC: tennis court, Ar: airplane, Bd: baseball diamond, Hb: harbor, Vc: vehicle, CR: cross-road, TJ: T-junction, PL: parking lot.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Jin, H. Detector Consistency Research on Remote Sensing Object Detection. Remote Sens. 2023, 15, 4130. https://doi.org/10.3390/rs15174130

AMA Style

Zhang Y, Jin H. Detector Consistency Research on Remote Sensing Object Detection. Remote Sensing. 2023; 15(17):4130. https://doi.org/10.3390/rs15174130

Chicago/Turabian Style

Zhang, Yuanlin, and Haiyan Jin. 2023. "Detector Consistency Research on Remote Sensing Object Detection" Remote Sensing 15, no. 17: 4130. https://doi.org/10.3390/rs15174130

APA Style

Zhang, Y., & Jin, H. (2023). Detector Consistency Research on Remote Sensing Object Detection. Remote Sensing, 15(17), 4130. https://doi.org/10.3390/rs15174130

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Detector Consistency Research on Remote Sensing Object Detection

Abstract

1. Introduction

2. Related Works

2.1. Traditional Manual Methods

2.1.1. Low-Level Feature Methods

2.1.2. Mid-Level Feature Methods

2.2. Neural Network Methods

2.2.1. One-Stage Methods

2.2.2. Two-Stage Methods

2.2.3. Multi-Stage Methods

3. Proposed Method

3.1. Entire Framework

3.1.1. CNN

3.1.2. FPN

3.1.3. RPN

3.1.4. RoIP

3.1.5. Network Head

3.2. Robust Network Head Architecture

3.2.1. Robust Network Head

3.2.2. Efficient Network Head

3.3. Proposal-Consistent Heads Cooperation Mechanism

3.3.1. Coarse-to-Fine Proposals

3.3.2. Heads Cooperation Mechanism

3.3.3. Comparison and Advantages

3.4. Training Loss Function

3.4.1. Classification Loss

3.4.2. Regression Loss

3.4.3. CMD Loss

4. Results and Analysis

4.1. Data Set Description

4.1.1. DIOR

4.1.2. HRRSD

4.1.3. NWPU VHR-10

4.2. Evaluation Metrics

4.2.1. AP and mAP

4.2.2. IoU

4.3. Implementation Details

4.3.1. Experimental Environment

4.3.2. Preprocessing

4.3.3. Training Parameters

4.3.4. CMD Parameters

4.4. Experimental Results

4.4.1. Validation Experiments

4.4.2. Ablation Experiments

4.4.3. Contrast Experiments

4.5. Failure Case Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI