1. Introduction
With the rapid development of modern remote sensing technology, remote sensing image processing has been widely used in various fields, such as object detection [
1,
2,
3], road mapping [
4,
5,
6], agricultural planning [
7,
8,
9], semantic analysis [
10,
11,
12] and urban planning [
13,
14,
15]. Recently, thanks to the stronger feature representation ability of Convolutional Neural Networks (CNNs) [
16,
17] and the availability of rich datasets with case-level annotations, the object detection of remote sensing images has achieved breakthrough performance. However, collecting such accurate annotations is labor-intensive and time-consuming. In contrast, scene-level tags can be easily obtained. This paper mainly studies the application of weakly supervised scene-level tags to realize semi-autonomous learning of object detections in remote sensing images.
In the existing remote sensing image object detection methods, the data used for the training network usually have object-level markers, which include the specific number, location, size or direction of all objects in the scene. However, the label at the scene level only records the category type of the main objects in each scene, and does not contain the specific object information in the scene. In addition, scenes with the same label usually contain different numbers of objects with different positions, sizes and orientations. The lack of this object information brings great challenges to the learning process.
With the aid of multiple instance learning, researchers [
16,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27] have shown that deep networks trained with only image-level/scene-level tags can generate pseudo ground truth, thus effectively predicting object location information. Most semi-autonomous learning object detection methods are divided into two stages [
22,
28,
29,
30]. In the first stage, a series of candidate boxes are generated by the object proposal method [
31,
32], and in the second stage, the features obtained from each proposal are regarded as instance-level classification problems. The core of these methods is to treat each image as a bag of potential object instance, and then train the instance classifier under the constraint of multi-instance learning. Promising results have been reported in the aforementioned methods. In [
23], Ren et al. proposed an unpacking algorithm which iteratively removes negative instances from positive bags; this algorithm can effectively reduce the ambiguity of positive bags. However, the method ignores the accuracy of positive bags and may result in less accurate positive bags. Bilen et al. [
22] proposed an effective multi-instance learning method, namely, the weakly supervised deep detection network (WSDDN), which can operate at the image region level and perform region selection and classification at the same time. Nevertheless, this method lacks displayed labels to classify instances. Later, Tang et al. [
28] proposed a new online instance classifier refinement (OICR) algorithm, which integrates the multiple instance learning and instance classifier optimization process into a single deep network, and at the same time, it proposes to optimize instance classifier online by using a spatial relationship. This method can learn more distinguished instance classifiers by assigning binary labels on display. However, this method makes the network pay more attention to part of objects and ignores the complete objects. Thereafter, a proposal cluster learning (PCL) is presented to [
29], to realize a refined instance classifier by learning the generated proposal clusters. As the proposals in the same cluster are associated with the same object, this can make the network focus more on the whole object instead of the parts of objects. In [
30], Feng et al. designed a dual contextual refinement strategy to shift the focus of the detection network from locally different parts to the whole object by using local and global context information.
Although these studies have achieved good results, the development of semi-autonomous learning in remote sensing images still faces two major challenges. The first challenge is that deep residual networks, such as ResNet [
33] and DenseNet [
34], have become the standard backbones of many computer vision tasks, while the advanced semi-autonomous learning methods for object detection still rely on common networks, such as VGG [
35]. The fundamental problem is unstable for the header of semi-autonomous learning, and is particularly sensitive to model initialization. This may propagate uncertain and wrong gradients back to the backbone, thus deteriorating visual representation learning. Especially in remote sensing image, the background is complex, the features are chaotic and the object size is small. The large kernel convolution kernel and non-maximum down-sampling in traditional ResNet will lead to the loss of highly informative features in the original image [
36].
The second challenge is that most methods of semi-autonomous learning tend to select only the candidate frames with the highest scores to train the corresponding detectors. However, the suggestions with the highest scores and their associated suggestions usually cover only a part of the object instead of the whole object, especially in the large-scale and chaotic remote sensing image. Therefore, the above methods may not be enough to extract the complete object features. In addition, remote sensing images usually contain multiple instances of the same kind. If the highest score and its associated suggestions are simply selected as pseudo GT, other instances of the same kind may be missed, resulting in suboptimal object detectors [
29].
For the first challenge, we designed a modified residual network (MRN) to solve the problem of losing highly informative features in the original image. Specifically, this paper replaces large convolution kernel and non-maximum down-sampling in a traditional residual network with a small kernel convolution and maximum down-sampling. It could enhance the robustness of network information flow by extracting finer object boundary features from the original image and keeping the information of instances. In the process of inputting remote sensing images into the deep neural network, continuous convolution and pooling will reduce the image size and increase the receptive field, which will lose some resolution and make some image details unable to be reconstructed. To solve this problem, the hole convolution is used in ResNet to increase the receptive field without the loss of feature information by the pooling operation. Each convolution output contains a wider range of feature information than an ordinary convolution, which is beneficial in obtaining the global information of the object features in remote sensing images.
In order to address the second challenge, this paper proposes an end-to-end aggregation-based region-merging strategy (ARMS), which combines similar regions in the cluster and deletes redundant regions in the cluster to select high-quality regions. In the process of network refinement, this strategy shifts the focus of network detection from part of the object to the whole object, and reduces the influence of background area on the network. More specifically, this paper does not directly select the suggestion with the highest score as pseudo-supervision, but selects other areas related to the suggestion with the highest score, and merges these related areas to form a complete object area, generating a new confidence level according to the merged form.
In this paper, MRN and ARMS are combined to form a progressive aggregation area instance refinement network. The MRN solve the problem of losing highly informative features in the original image. It extracts finer object boundary features in image processing, and preserves the information of instances. The ARMS is designed to select high-quality instances by selecting aggregation areas and merging these regions. ARMS selects the aggregation areas that are highly related to the object information through the association coefficient, and then evaluates the aggregation areas through the similarity coefficient and fuses them to obtain high-quality object instance areas. Moreover, a regression-locating branch is further developed to refine the location of the object, which can be optimized jointly with regional classification. Experiments on augmented NWPU VHR-10 and LEVIR datasets clearly demonstrate that the proposed method significantly outperforms previous state-of-the-art semi-autonomous learning object detection approaches on LEVIR and augmented NWPU datasets. In summary, the main contributions of this paper can be summarized as follows: (1) An end-to-end framework is proposed to realize semi-autonomous learning object detection based on image level annotations, which can jointly optimize the region classification and regression to improve the accuracy of object location. (2) The MRN is applied as the backbone to implement a lightweight network structure, which provides finer object boundaries and preserves the information of small instances, enhancing the robustness of information flow in the network. (3) A novel ARMS algorithm is proposed to utilize the cluster information of the region to obtain high-quality object features, and further suppress the interference of messy background information.