Next Article in Journal
Estimation of Daily Ground Level Air Pollution in Italian Municipalities with Machine Learning Models Using Sentinel-5P and ERA5 Data
Previous Article in Journal
Impact of Traffic Flow Rate on the Accuracy of Short-Term Prediction of Origin-Destination Matrix in Urban Transportation Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Weakly Supervised Object Detection for Remote Sensing Images via Progressive Image-Level and Instance-Level Feature Refinement

1
School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
2
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2024, 16(7), 1203; https://doi.org/10.3390/rs16071203
Submission received: 17 January 2024 / Revised: 14 March 2024 / Accepted: 27 March 2024 / Published: 29 March 2024

Abstract

:
Weakly supervised object detection (WSOD) aims to predict a set of bounding boxes and corresponding category labels for instances with only image-level supervisions. Compared with fully supervised object detection, WSOD in remote sensing images (RSIs) is much more challenging due to the vast foreground-related context regions. In this paper, we propose a progressive image-level and instance-level feature refinement network to address the problems of missing detection and part domination for WSOD in RSIs. Firstly, we propose a multi-label attention mining loss (MAML)-guided image-level feature refinement branch to effectively allocate the computational resources towards the most informative part of images. With the supervision of MAML, all latent instances in images are emphasized. However, image-level feature refinement further expands responsive gaps between the informative part and other sub-optimal informative ones, which results in exacerbating the problem of part domination. In order to alleviate the above-mentioned limitation, we further construct an instance-level feature refinement branch to re-balance the contributions of different adjacent candidate bounding boxes according to the detection task. An instance selection loss (ISL) is proposed to progressively boost the representation of salient regions by exploring supervision from the network itself. Finally, we integrate the image-level and instance-level feature refinement branches into a complete network and the proposed MAML and ISL functions are merged with class classification and box regression to optimize the whole WSOD network in an end-to-end training fashion. We conduct experiments on two popular WSOD datasets, NWPU VHR-10.v2 and DIOR. All the experimental results demonstrate that our method achieves a competitive performance compared with other state-of-the-art approaches.

1. Introduction

Convolutional neural networks (CNNs) have facilitated the progress of different computer visual tasks [1,2,3,4,5,6], including object detection [7,8,9,10] in remote sensing images (RSIs). However, accurate bounding box (BB)-level annotations [11,12,13] are difficult to obtain. Due to the lacking of professional knowledge, manually annotating image- and BB-level labels by staff can be labor intensive and imprecise. In this paper, we focus on weakly supervised learning (WSL) for object detection.
Most of the previous weakly supervised object detection (WSOD) methods [14,15,16,17] treated WSL as a multiple instance learning (MIL) problem. First, images are defined as bags and different region proposal methods [18,19] are utilized to generate the candidate regions. The weakly supervised deep detection network (WSDDN) was proposed in [14], which simultaneously performed region selection and category classification tasks. Tang et al. [15] proposed a multiple online instance classifier refinement (OICR) to propagate category label inference for refining the object classifier. The OICR selected the top-scoring proposal and assigned the same labels for its spatially high-overlapped regions. Tang [20] also designed a proposal cluster learning (PCL) network to boost the performance of the base OICR. The PCL network divided each instance bag into various small new bags and reduced the equivocal supervisions for these bags.
According to [21,22,23,24,25,26,27,28], WSOD in RSIs still encounters two major challenges. First of all, most previous methods tend to focus on the most discriminative regions in an image (part domination). For example, airfoil and fuselage regions might have the greatest contributions to recognizing airplane instances. However, using an airfoil or fuselage part to represent the entire airplane instance is obviously problematic. This problem is further aggravated in RSIs due to the vast foreground-related context regions. Second, there are always multiple instances with the same category in RSIs. Most WSOD methods use the highest-scoring proposals as pseudo-supervision to optimize the detection model, which will result in omitting many valuable positive proposals (missing detection).
Due to the lack of BB annotations, WSOD methods have no proper guidance and can only rely on the ill-posed training operation. As an effective way to provide reliable priors, an attention mechanism (AM) is used as a component to allocate the computational resources towards the most informative part of images. Thus, all the useful information can be highlighted. However, as illustrated in the second row of Figure 1, image-level features refined by the attention mechanism denote ‘airfoil’ as a detection prediction for the class ‘airplane’, resulting in inferior performance of the WSOD (part domination). Meanwhile, the uncontrollable nature of the attention mechanism can further exacerbate the aforementioned problem. On the contrary, exploring the spatial diversification constraints is a valid method to alleviate the problem of part domination. As illustrated in the third row of Figure 1, instance-level features refined by the spatial diversification constraints have great potential for excavating the complete instance according to instance-associative learning.
In this paper, we propose a progressive image-level and instance-level feature refinement network to perceive more latent positive proposals and excavate the complete instance for WSOD in RSIs. First, we propose an image-level feature refinement branch, which involves the attention mechanism to excavate all the informative regions in features. A novel multi-label attention mining loss (MAML) is constructed to train the attention mask as a category-specific descriptor. For the instance-level feature refinement branch, we propose a salient RoI feature selection strategy to pick out all crucial regions with its spatially high-overlapped proposals to model the position correlation among different instance-level features. An instance selection loss (ISL) is proposed to further boost the representation of the selected salient regions by exploring supervision from the network itself. Finally, we jointly optimize the region classification and box regression with our constructed MAML and ISL loss functions. Our contributions can be summarized as follows:
  • We propose a progressive image-level and instance-level feature refinement network to perceive more latent positive proposals and explore the instance-associative spatial correlation among instance regions.
  • A multi-label attention mining loss (MAML) and an instance selection loss (ISL) are constructed to boost the representation of image-level and instance-level features by exploring supervision from the network itself.
  • We optimize the classification and regression tasks by the constructed MAML and ISL loss functions to boost the performance of WSOD in RSIs. The proposed method outperforms previous state-of-the-art approaches on the NWPU VHR-10.v2 and DIOR datasets, which demonstrates the effectiveness of boosting deep features at both image level and instance level for WSOD.
The organization of this paper is as follows: In Section 2, we introduce the studies related to our work. Then, in Section 3, we introduce the details of our method. In Section 4, we compare our approaches with other WSOD methods on the NWPU VHR-10.v2 and DIOR datasets. Finally, the conclusions on our proposed method are given in Section 5.

2. Related Work

2.1. Weakly Supervised Object Detection

Multiple instance learning (MIL) [29] is the most frequently used method in WSOD tasks. For example, ref. [14] proposed for the first time a WSDDN, a convolutional neural network-based WSOD method to perform region classification and selection tasks, simultaneously. With only image-level supervision, Tang et al. [15] combined MIL and the progressive refined instance classifier into a single detection network to construct an online instance classifier refinement (OICR) algorithm. The proposed OICR constructed multiple streams in a deep network, where each former branch is supervised by its latter one. Based on this, a host of follow-up studies [30,31] were developed.
Although WSOD has made significant progress in natural scenes, it is still a nontrivial task to directly apply these approaches in RSIs. According to [32,33,34,35], WSOD in RSIs still suffers from two major challenges: part domination and missing detection. Qian et al. [36] constructed a part-based topology distillation network (PTDNet) to perceive the combined instances via the extracted most informative parts of objects. Layer-wise relevance propagation (LRP) and point set representation (RepPoints) [37] were proposed to reduce the ambiguities in object recognition. The improved oriented loss function (IOLF) and pseudo-oriented bounding boxes were used as the supervisions to train the aforementioned modules. Qian et al. [38] proposed a pseudo-instance soft label (PSL) to evaluate whether each proposal covers a complete instance. Feng et al. [39] proposed a rotation-invariant aerial object detection network (RINet), which constructed a flexible multi-branch online detector refinement to perceive more rotated instances in RSIs. After coupling instance prediction with its different rotated variants, the rotation-consistent supervision was proposed to improve the performance of WSOD.

2.2. Feature Refinement for WSOD in RSIs

2.2.1. Image-Level Feature Refinement

In order to model complementary and discriminative visual patterns, a triple context-aware network (TCANet) [24] was proposed to capture the global visual context of image-level features. Moreover, the semantic discrepancy of the local context was leveraged by the constructed dual-local context residual (DLCR) module. Both the global and local image-level contexts were used to boost the performance of WSOD. Ma et al. [40] re-labeled an optical RSI dataset in a scribble manner and proposed a scribble embedding network (SEN) to extract discriminant regions in images. SEN unified correlation filtering technology into the proposed WSOD network, which successfully emphasizes the foreground features and suppresses the cluttered background. Shamsolmoali et al. [41] believed that most existing WSOD methods found it difficult to detect small objects in RSIs. The proposed multi-patch feature pyramid network (MPFP-Net) firstly divided different images into the class-affiliated subsets. An effective regularization strategy was constructed to perform a norm-preserving fusion transition. Therefore, image-level feature refinement has great potential for addressing the problem of missing detection in RSIs. However, re-balancing the computational resources towards all the informative parts of images inevitably exacerbates the problem of part domination. So, many researchers were devoted to modeling the interaction among instance-level features for perceiving the complete instances.

2.2.2. Instance-Level Feature Refinement

In [21], a multiple instance graph (MIG) learning framework was proposed with a spatial graph-based vote (SGV) mechanism to collect the highest-scoring proposals with its high spatial overlap regions as positive objects. Moreover, latent instances with the same category were excavated via the constructed appearance graph-based instance mining (AGIM) module. Xie et al. [42] introduced an attention erasing (AE) strategy into the WSOD model for dropping the most discriminative region. An IoU-balanced sampling component was constructed to excavate the complete instances, which further enhanced the performance of WSOD. Cheng et al. [23] proposed a self-guided proposal generation (SPG) module to excavate more reliable candidate proposals, which explicitly generated more high-quality instance-level features. Tan et al. [37] unified horizontal and oriented object detection tasks into a complete WSOD framework to detect different oriented instances in RSIs. Feng et al. [43] proposed an end-to-end progressive contextual instance refinement (PCIR) framework, which constructed a dual-contextual instance refinement (DCIR) strategy to re-balance the focus of the WSOD network from the local discriminative part to the complete instance. The complementary multiple instance detection network (CMIDN) and complementary feature learning (CEL) [44] module were proposed to excavate more latent objects. The contribution of each proposal was weighted in the final loss function.

3. The Proposed Method

The overview of our proposed method is illustrated in Figure 2. First, a VGG16 backbone network, which is pre-trained on a large-scale image-level classification dataset [11], is used to extract coarse image-level features F. Then, an image-level feature refinement branch is constructed to excavate all the informative regions in image. In order to guide the attention weights learning, we proposed a multi-label attention mining loss (MAML) function to enhance the localization ability of feature refinement with the supervision of image-level annotations. After feeding the enhanced image-level features F A M into RoI Pooling layer, we obtain the original instance-level RoI features R. All the RoI features are fed into our instance-level feature refinement branch. Then, a set of salient RoIs R A B can be obtained via the proposed salient RoI feature selection (SRFS) strategy. Because R A B are selected by the category score vectors and GT image-level labels, these instance-level RoI features are more likely to contain the informative regions of certain categories. Therefore, we further construct an instance selection loss (ISL) function to provide the explicit indicator to guide the processing of instance-level feature selection. Finally, the enhanced instance-level RoI features R e are fed into a MIL branch to perform pseudo-location regression and proposal classification tasks. The algorithmic pipeline of our method is formulated in Algorithm 1.
Algorithm 1 Pipeline of image-level and instance-level feature refinement.
Require: 
Coarse backbone features F, image-level labels L, score threshold T s , I o U threshold T i , number of salient RoI T n
Ensure: 
Enhanced instance-level features R e , MAML loss function L M A M L , ISL loss function  L I S L
1:Generate attention map W by Equations (1) and (2)
2:Calculate MAML loss L M A M L by Equations (4) and (5)
3:Generate enhanced image-level features F e by Equation (3)
4:Generate instance-level features R by RoI Pooling layer
5:Calculate proposal scores R s by Equations (6)–(8)
6:if image-level labels L contains category i then
7:    Sort scores R s by confidence of category i
8:    for r in R do
9:        if sorted scores r s T s  then
10:           Put r in R A
11:for r in ( R R A ) do
12:    Calculate I o U between R A and ( R R A )
13:    if  I o U T i  then
14:        Put r in R B
15:Calculate number of ( R A and R B ) by M = A + B
16:while  M T n  do
17:    for r in ( R R A R B ) do
18:        Randomly put r in R C
19:         M = M + 1
20:Generate enhanced instance-level features R e by Equations (9) and (10)
21:Calculate ISL loss L I S L by Equation (11)

3.1. Motivations

Due to the complex background information and the vast foreground-related context regions, it is easy for instances with small scales to be submerged by useless regions, resulting in the problem of missing detection. Thus, our proposed image-level feature refinement branch aims to excavate all the informative regions, which has great potential for perceiving all the target instances in images. However, taking the computational resources towards the most informative part of objects inevitably exacerbates the problem of part domination. For example, airfoil and fuselage regions might have the greatest contributions to detect airplane instances. These regions can be further emphasized by the constructed image-level feature refinement branch. However, using airfoil or fuselage to represent the entire airplane instance is obviously problematic. Therefore, we construct an instance-level feature refinement branch to explore the instance-associative spatial diversification constraints. As illustrated in the second and third rows of Figure 1, the image-level feature refinement branch successfully detects the small airplane while regarding the partial region of harbor as the final detection result. Conversely, our proposed instance-level feature refinement branch recognizes the complete harbor instance while missing the small airplane. The aforementioned analysis inclines us to unify the image-level and instance-level feature refinement branches into a complete network to boost the performance of WSOD in RSIs.

3.2. Image-Level Feature Refinement Branch

Given an intermediate feature map F R C × W × H as input, the attention mechanism is adopted to determine a spatially normalized attention weight map to enhance the representation of F. Here, we follow the conventional pipeline to refine the image-level features.
Consider that the global max pooling (GMP) layer focuses on the area with the highest responsive value and reduces the impact of unrelated information, which aligns with the working principle of our proposed image-level feature refinement branch. Moreover, the channel-wise global average pooling (GAP) aggregates multiple representations of each pixel and encourages the network to distinguish instances with different categories. Therefore, we utilize a combination of GAP and GMP to generate the spatial context descriptor as
F A M = [ 1 c i = 1 c F p i , m a x ( F p i ) | i = 1 c ] ,
where F A M R 2 × W × H . p and c denote the pixel position and channel dimension of F, respectively. [ ] represents the concatenation operation. These two features are fused to gather the contextual information of instances. We then use a convolutional layer C to boost F A M as
W = σ ( C ( F A M ) ) ,
where W R 1 × W × H denotes the spatial attention mask and σ represents S i g m o i d function. The final enhanced feature F e can be formulated as
F e = W F ,
where ⊙ denotes the element-wise production.
Importantly, Attention mechanism is an effective way to re-balance the partialness of different regions in images which could perceive all the foreground instances. However, supervised by only image-level annotations, attention mechanism always confuses instances that are similar in appearances. To address the aforementioned limitation, we proposed a multi-label attention mining loss (MAML) function for excavating latent supervision from the network itself.
Sine we attain the attention mask W in Equation (2), we feed W into a convolutional layer C and the predicted vector y ^ is determined by a GAP layer G . The aforementioned operations can be formulated as
y ^ = G ( C ( W ) ) ,
Finally, the predicted score vector y ^ can be supervised by the MAML loss as
L M A M L = c = 1 C ( y c l o g y c ^ + ( 1 y c ) l o g ( 1 y c ^ ) ) ,
where y c denotes the ground truth of image-level classification annotations. Obviously, the constructed MAML supervision provides an explicit descriptor for detection network to estimate whether an image contains a certain category instance. The final attention-refined feature map F e is fed into the subsequent network for instance-level feature refinement.

3.3. Instance-Level Feature Refinement Branch

Taking F e with its candidate proposals into RoI Pooling layer [45], the fixed-size instance-level features R R N × D are normalized into the same scale, where N, D represent the number and the dimension of R, respectively. Commonly, previous WSOD methods sort these instance-level features by the predicted score vectors and use the highest scorer to optimize their detection networks. This process is problematic in WSOD because the candidate proposals contain a larger number of negative regions and a small number of positive proposals. Obviously, the positive proposals contribute more to detecting objects in RSIs. To alleviate aforementioned limitation, we propose a salient RoI feature selection (SRFS) strategy to pick out all crucial regions with its spatially high-overlapped proposals to model the position correlation among different instance-level features.
Given a set of instance-level features R = [ r 1 , r 1 , . . . , r N ] , two fully connected layers are used to produce different vectors R c and R d R N × C where C denotes the number of categories. These two vectors are fused by two S o f t m a x layers ρ as
[ ρ ( R c ) ] i j = e R i j c k = 1 C e R k j c ,
[ ρ ( R d ) ] i j = e R i j d k = 1 N e R i k d ,
where ρ ( R c ) and ρ ( R c ) denote classification and detection predictions. The proposal scores R s are generated as
R s = ρ ( R c ) ρ ( R d ) ,
where ⊙ denotes element-wise product.

3.3.1. Salient RoI Feature Selection

Since we attain the instance-level features R R N × D and the corresponding proposal scores R s R N × C , the top-ranking proposal with its highly spatially overlapped regions can roughly determine the localization of objects, because the aforementioned proposals cover the most discriminative regions in images. As illustrated in Figure 3, the image-level annotation is ‘airplane’. After transferring image-level label into the one-hot vector, we obtain the transferred annotation y c = [ 1 , 0 , 0 , . . . , 0 ] , where the index 0 of y c denotes the corresponding category. Given the predicted scores R s , we select the corresponding row (index 0) of the category score vectors and put RoI features with category score higher than a threshold T s (0.5 in our experiments) into a subset of salient RoI features, named R A R A × D . Additionally, regions which are highly overlapped with R A may belong to the same category. Therefore, we calculate the IoU between R A and all other RoIs. When the IoU is larger than a threshold T i (0.5 in our experiments), we put these RoI features into a subset of salient RoI features, named R B R B × D . Finally, we randomly pick the remaining set ( R R A B ) of RoI features to attain R C R C × D , until the number of the selected RoIs is equal to the defined threshold (300 in our model).
After obtaining the salient instance-level features R M R M × D where R M = R A B C , we adopt self-attention mechanism [46] to model the correlation among these instance-level features.
w = ρ ( W q u e r y ( R ) · W k e y ( R M ) T d k e y ) ,
R e = w · W v a l u e ( R M ) ,
where M denotes the number of selected instance-level features; w R N × M represents the similarity weight; and W q u e r y , W k e y and W v a l u e denote different L i n e a r layers. The final enhanced instance-level features R e are fed into the MIL branch for box regression and category classification.

3.3.2. Instance Selection Loss

We constructed an instance selection loss (ISL) function to progressively boost the representation of the selected RoI features. Firstly, all the selected salient proposals (except R C ) are more likely to contain the key region of one certain category. The rest should not trigger the network to recognize the object. Therefore, R A B can be supervised by the loss [14]. The score of these proposals are generated by element-wise product s A B = ρ ( W 4 ( R A B ) ) · ρ ( W 5 ( R A B ) ) where W 4 and W 5 denote different linear functions. The c t h class score can be obtained by summing up the score vectors over all proposals y c A B ^ = k = 1 A B s c k . Our selection loss is formulated as
L I S L = c = 1 C ( y c l o g y c A B ^ + ( 1 y c ) l o g ( 1 y c A B ^ ) ) ,
where y c is the ground truth of image-level classification annotations. With the supervision of L I S L , the network learns to model the inter-dependencies between the most salient instance-level RoI features and other proposals, which has great potential for perceiving the complete objects. We demonstrate the effectiveness of our instance-level feature refinement in Section 4.

3.4. MIL Branch

We use a pseudo-fully supervised classification and regression method [16] to refine the final results. Since the predicted instance labels can be obtained by the method in [14], the standard bounding-box regression and category classification loss functions can be formulated as
L D E T = L C L S + λ L L O C ,
where L C L S is classification loss and L L O C is smooth L 1 loss. Moreover, our complete network is trained by optimizing the comprehensive loss function as
L = L M A M L + L I S L + L W + L R + L D E T .
where L W and L R are the MLC loss of WSDDN [14] and refinement loss of OICR [15].

4. Experiments

4.1. Datasets and Evaluation Metrics

We evaluate our proposed method on the NWPU VHR-10.v2 https://drive.google.com/file/d/15xd4TASVAC2irRf02GA4LqYFbH7QITR-/view?usp=sharing (accessed on 20 March 2024) [22] and DIOR https://drive.google.com/drive/folders/1UdlgHk49iu6WpcJ5467iT-UqNPpx__CC (accessed on 20 March 2024) [47] datasets, which are the most frequently used benchmarks in RSI-based WSOD methods. In our experiments, we only adopt image-level annotations to train our model.
NWPU VHR-10.v2 [22] is modified from NWPU VHR-10 [48], which crops each image into 400 × 400 pixels. There are 1172 images in the NWPU VHR-10.v2 dataset. A total of 10 categories with 2775 instances are annotated in NWPU VHR-10.v2, namely, Airplane, Ship, Storage tank, Baseball diamond, Tennis court, Basketball court, Ground track field, Harbor, Bridge, and Vehicle. Following the standard protocols [21,22,23], our method uses 879 images for training and 293 images for testing.
DIOR is more challenging than NWPU VHR-10.v2 and contains 23,463 samples with a size of 800 × 800 pixels. A total of 20 categories with 192,518 instances are annotated in DIOR, namely, Airplane, Airport, Baseball field, Basketball court, Bridge, Chimney, Dam, Expressway service area, Expressway toll station, Golf field, Ground track field, Harbor, Overpass, Ship, Stadium, Storage tank, Tennis court, Train station, Vehicle, and Windmill. DIOR is divided into training, validation and testing sets. All the previous WSOD methods [14,15,20,21,22,23,24,37,38,43,49,50] use a combination of training and validation for training (11,725 images). The remaining 11,738 images in the testing set are used for testing. In order to facilitate fair comparisons, we also adopt the aforementioned settings.
The popular evaluation metrics for WSOD in RSIs are as follows: (1) We use the average precision (AP) and mean of AP (mAP) evaluation metrics to evaluate the performance of our method. (2) Correct localization (CorLoc) is another widely used WSOD evaluation metric to measure the accuracy of localization in the training set.

4.2. Implementation Details

In our experiments, only one RTX 1080ti GPU is used for training and testing. Following the standard settings in method [24], we use a pre-trained VGG16 network as the base feature extractor to extract image features. The selective search [18] tool is introduced to obtain roughly 2000 candidate proposals per image. For NWPU VHR-10.v2, the initial learning rate is set to 0.001 and a total of 30K iterations are used to train our model. The learning rate is decreased by 0.1 at iteration 20K. For DIOR, we train our model for 200K iterations with the learning rate decreased by 0.1 at iteration 100K. We set the batch size to 1 and the stochastic gradient descent (SGD) optimizer is adopted. The momentum and weight decay are set to 0.9 and 0.0005, respectively.
During training, we use five image scales {480, 576, 688, 864 and 1200} and adopt three rotation transformations {90°, 180° and 270°} as data augmentation strategies. During testing, we set the confidence threshold to 0.3 for performing non-maximum suppression (NMS).

4.3. Performance Evaluation

In order to evaluate the effectiveness of the proposed method, we compare our method with 12 WSOD approaches, including WSDDN [14], OICR [15], PCL [20], min-entropy latent model (MELM) [49], DCL [22], PCIR [43], MIG [21], TCANet [24], SPG [23], self-supervised adversarial and equivariant network (SAENet) [50], weakly supervised oriented detector (WSODet) [37] and PSL [38]. Most of the aforementioned approaches are published in the last two years. Following the methods in [37,38,50], we list the detection results reported in their published papers.

4.3.1. NWPU VHR-10.v2

As listed in Table 1 and Table 2, WSDDN [14], OICR [15], PCL [20] and MELM [21] are the most frequently used baseline methods for WSOD, which achieve 35.1%, 34.5%, 39.4% and 42.3% mAP, and 35.3%, 40.0% 45.6% and 49.9% CorLoc, respectively. Although the detection performances of the aforementioned methods are inferior, they have laid solid foundations for the subsequent works. TCANet [24] crops the original image-level features into three sub-patches and models the interaction among these sub-features. However, re-balancing the computational resources towards the informative part of images inevitably exacerbates the problem of part domination, so that our proposed method achieves 6.4% mAP and 2.4% CorLoc gains over TCANet. Conversely, PCIR [43], MIG [21] and SPG [23] are all devoted to boosting the representation of instance-level features. However, due to the neglect of image-level feature refinement, it is difficult for the aforementioned three approaches to locate small instances. The best competitor, PSL [38], constructs the proposal quality score to excavate pseudo-bounding boxes of instances. Semantic segmentation is employed to perceive the complete object. However, it is difficult for the proposed pseudo-proposal generation strategy to cover objects with small scales. Through refining image-level features, our method achieves 1.4% mAP and 0.6% Corloc gains over PSL. Importantly, we also list the mAPs of many fully supervised object detection (FSOD) methods in Table 1. It can be clearly observed that the proposed method significantly narrows the accuracy gap between WSOD and FSOD. We also visualize many detection results in Figure 4 to analyze the effectiveness of our proposed method. The baseline method is biased towards bounding boxes which cover the most representative parts of the entire instances. Meanwhile, it is easier for objects with the same category in crowded scenes to be mis-detected by the baseline approach. As illustrated in the second row of Figure 4, our method can accurately cover most instances, which demonstrates that boosting deep features from the image level and instance level has great potential for addressing the problems of part domination and missing detection for WSOD in RSIs.

4.3.2. DIOR

We list the detailed comparisons on the more challenging DIOR dataset in Table 3 and Table 4. Consider that objects in DIOR are much smaller and more densely packed than in NWPU VHR-10.v2. Therefore, perceiving more latent positive proposals (our image-level feature refinement branch) and modeling the instance-associative spatial correlation among instances (our instance-level feature refinement branch) has great potential for alleviating the problems of missing detection and part domination. The evaluation criteria mAP and CorLoc also prove the effectiveness of our approach. Specifically, the proposed method outperforms the WSDDN [14], OICR [15], PCL [20], MELM [49], DCL [22], PCIR [43], MIG [21], TCANet [24], SPG [23], SAENet [50], WSODet [37] and PSL [38] by 15.8% (29.1% v.s. 13.3%), 12.6% (29.1% v.s. 16.5%), 10.9% (29.1% v.s. 18.2%), 10.4% (29.1% v.s. 18.7%), 8.9% (29.1% v.s. 20.2%), 4.2% (29.1% v.s. 24.9%), 4.0% (29.1% v.s. 25.1%), 3.3% (29.1% v.s. 25.8%), 3.3% (29.1% v.s. 25.8%), 2.0% (29.1% v.s. 27.1%), 1.8% (29.1% v.s. 27.3%) and 0.5% (29.1% v.s. 28.6%), which are notable margins in terms of mAP. Figure 5 illustrates some detection comparisons of many approaches on DIOR dataset. It can be clearly seen that the baseline method suffers from distinguishing instances with similar appearances. Conversely, our method captures more complete objects and reduces the instance ambiguity induced from extreme scales and poses, as well as similar appearances.
Additionally, we further count the number of different categories in DIOR. As illustrated in Figure 6, most classes in the training and testing sets are approximately equal. The number ratio of five categories in the training and testing sets is approximately 1:2. The number ratio of two categories in the training and testing sets varies greatly. It can be clearly seen that the standard partitioning of the dataset indeed causes the unbalanced distribution of categories. The aforementioned statistical overview inspires our subsequent work. Based on the standard dataset partitioning principle, designing the category-balancing strategy is a potentially useful attempt to boost the performance of WSOD in the DIOR dataset.

4.4. Ablation Studies

We denote our proposed method with the image-level feature refinement branch and instance-level feature refinement branch as IM and IN, respectively. Moreover, we further identify our proposed image-level and instance-level feature refinement branches with the corresponding multi-label attention mining loss (MAML) and instance selection loss (ISL) as w . MAML, w . o . MAML, w . ISL and w . o . ISL, respectively. Importantly, we introduce rotation-invariant learning (RL) in MELM [49], which is defined as method A in Table 5 and Table 6.
For the NWPU VHR-10.v2 dataset, the mAP and CorLoc of six methods with different configurations are compared in Table 5. Method A is the baseline network which achieves 56.3% mAP and 70.4% CorLoc, respectively. After integrating the proposed image-level feature refinement branch (without MAML) into the baseline method, method B achieves 61.4% mAP and 71.5% CorLoc, which is 5.1% and 1.1% higher than the baseline. However, with only image-level supervision, the attention mechanism tends to confuse instances with similar appearances. Method C is the combination of IM and MAML which achieves 0.9% mAP and 1.1% CorLoc gains compared to Method B. The aforementioned approaches focus on re-balancing the computational resources towards all the informative part of images so that most positive proposals can be retained. Method D is the combination of the instance-level feature refinement branch (without ISL) and the baseline method, which aims to explore the instance-associative spatial correlation among the instance regions to alleviate the problem of part domination. Specifically, method D achieves 61.3% mAP and 72.2% CorLoc, which is 5.0% and 1.8% higher than the baseline approach. After integrating ISL into the IN, method E achieves 0.7% mAP and 0.9% CorLoc gains compared to method D. Importantly, the best performance is still attained by refining deep features from the image level (with MAML) and instance level (with ISL), which achieves 65.2% mAP and 75.2% CorLoc, respectively.
The detailed comparisons on the DIOR dataset also validate the effectiveness of our method. Compared with the baseline network, IM and IN integrated into the baseline method achieve 0.9% mAP, 3.2% CorLoc and 1.8% mAP, 4.3% CorLoc gains, respectively. After integrating MAML and ISL into the aforementioned feature refinement branches, method C and method E attain 26.7% mAP and 50.3% CorLoc, and 27.3% mAP and 49.1% CorLoc, which are 2.1% mAP and 4.0% CorLoc, and 2.7% mAP and 2.8% CorLoc higher than the baseline method. Importantly, method E obtains the best mAP (29.1%) and CorLoc (52.3%) on the DIOR dataset, which demonstrates that the combination of image-level and instance-level feature refinement is effective. We further visualize many detection results in Figure 7 and Figure 8. Compared with the baseline, the proposed image-level feature refinement branch successfully alleviates the problem of missing detection. However, allocating the computational resources towards the most informative part in images inevitably exacerbates the problem of part domination. For example, airfoil and fuselage regions might have the greatest contributions to detect airplanes. However, using the airfoil or fuselage parts to represent the entire airplane is obviously problematic. In order to alleviate the aforementioned limitation, our instance-level feature refinement branch explores the instance-associative spatial diversification constraints, which has great potential for excavating the complete instances.

4.5. Extension: Performance on Natural Images

In order to verify the validity of our proposed method, we further evaluate the proposed model on PASCAL VOC2007 http://host.robots.ox.ac.uk/pascal/VOC/voc2007/index.html (accessed on 20 March 2024) [12], PASCAL VOC2012 http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html (accessed on 20 March 2024) [12] and COCO https://cocodataset.org/#download (accessed on 20 March 2024) [13], which are the most popular datasets for natural image-based WSOD methods. Following the standard experimental settings [15,16], we only use image-level annotations for training.
We list the detailed comparisons on the PASCAL VOC2007 and 2012 datasets in Table 7. All the detection results are obtained from the original paper. Our proposed method achieves 57.5% mAP and 71.0% CorLoc, outperforming all the compared WSOD approaches. REG [16] uses an attention mechanism to boost the representation of image-level features but ignores modeling the correlation among different instances. Thus, our method outperforms REG by 8.9% mAP and 4.2% CorLoc, and 7.1% mAP and 1.8% CorLoc on the VOC2007 and 2012 datasets, respectively. Importantly, our proposed method outperforms the best two competitors, PSL [38] and Ren et al. [56], by a conspicuous margin. Compared with many other WSOD methods, our detection results on the VOC2007 and 2012 datasets demonstrate the effectiveness of refining deep features at both image level and instance level. Figure 9 also illustrates many detection results of many approaches on the VOC dataset. It can be clearly seen that our method has great potential for capturing more complete objects and reducing instance ambiguity induced from extreme scales, as well as similar appearances.
Additionally, Table 8 shows the comparative experiments on the COCO dataset. The detection results are partially obtained from [17] and partially obtained by our reconstruction, REG [16]. To facilitate a fair comparison, we adopt the same settings as [17]. Compared with other single-model-based methods, our method achieves 26.9% AP50, which is 12.6% (26.9% v.s. 14.3%), 10.7% (26.9% v.s. 16.2%), 8.1% (26.9% v.s. 18.8%), 7.5% (26.9% v.s. 19.4%), 7.3% (26.9% v.s. 19.6%), 6.6% (26.9% v.s. 20.3%), 6.2% (26.9% v.s. 20.7%) and 1.1% (26.9% v.s. 25.8%) higher that OICR [15], ML-LocNet [63], MELM [49], PCL [20], REG [16], WS-JDS [64], PG + PS [17] and Ren et al. [56], respectively. For other evaluation metrics on the COCO dataset, the proposed method also achieves the best AP75 (12.4%) and AP (12.9%), which is 1.9% and 0.5% higher than the best competitor, Ren et al. [56]. We also compare our method with other ensemble-learning-based approaches in Table 8. Following the methods in [14,15], we simply average the category scores predicted by our IM-integrated, IN-integrated and IM+IN-integrated models. Our method achieves 27.8% AP50, 13.4% AP75 and 14.7% AP, which attains 0.9% AP50, 1.0% AP75 and 1.8% AP gains compared to our single-model-based method.

5. Conclusions

In this paper, we propose a progressive image-level and instance-level feature refinement network to boost the representation of deep features for weakly supervised object detection (WSOD) in RSIs. First, the image-level feature refinement branch is constructed to alleviate the problem of missing detection. With the supervision of the proposed multi-label attention mining loss (MAML), our method provides an explicit indicator to decrease the ambiguities in object recognition. Moreover, we further construct an instance-level feature refinement branch to model the region-associative spatial diversification for re-scoring the confidence of different adjacent bounding boxes. An instance selection loss (ISL) is proposed to excavate the salient RoI features for emphasizing the representation of instance-level features. The aforementioned feature refinement branches are unified into the complete WSOD network. The region classification and box regression tasks are optimized by our constructed MAML and ISL loss functions to boost the performance of WSOD in RSIs. The proposed method outperforms previous state-of-the-art approaches on the NWPU VHR-10.v2 and DIOR datasets, which demonstrates the effectiveness of boosting deep features at both image level and instance level for WSOD.

Author Contributions

Conceptualization, S.Z. and Z.W. (Zebin Wu); methodology, Z.W. (Zebin Wu); software, Y.X.; validation, S.Z. and Z.W. (Zebin Wu); formal analysis, Z.W. (Zhihui Wei); investigation, S.Z. and Y.X.; resources, Z.W. (Zebin Wu); data curation, S.Z., Y.X. and Z.W. (Zebin Wu); writing—original draft preparation, S.Z.; writing—review and editing, S.Z. and Z.W. (Zhihui Wei); visualization, S.Z. and Y.X.; supervision, Z.W. (Zebin Wu) and Z.W. (Zhihui Wei); project administration, Z.W. (Zebin Wu); funding acquisition, Z.W. (Zebin Wu). All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded in part by the National Natural Science Founation of China under Grant U23B2006; in part by the Jiangsu Provincial Innovation Support Program under Grant BZ2023046; in part by the Jiangsu Provincial Key Research and Development Program under Grant BE2022065-2; in part by the National Natural Science Founation of China under Grant 62071233; and in part by the Jiangsu Provincial Natural Science Foundation of China under Grant BK20211570.

Data Availability Statement

The datasets in this study are available online from https://drive.google.com/file/d/15xd4TASVAC2irRf02GA4LqYFbH7QITR-/view (accessed on 20 March 2024) and https://drive.google.com/drive/folders/1UdlgHk49iu6WpcJ5467iT-UqNPpx__CC (accessed on 20 March 2024).

Acknowledgments

The authors would like to thank the support from NVIDIA Corporation for providing the GeForce GTX 1080ti used in this research.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Zhang, T.; Zhuang, Y.; Chen, H.; Chen, L.; Wang, G.; Gao, P.; Dong, H. Object-Centric Masked Image Modeling-Based Self-Supervised Pretraining for Remote Sensing Object Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 5013–5025. [Google Scholar] [CrossRef]
  2. Gao, L.; Li, J.; Zheng, K.; Jia, X. Enhanced Autoencoders With Attention-Embedded Degradation Learning for Unsupervised Hyperspectral Image Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5509417. [Google Scholar] [CrossRef]
  3. Gao, L.; Sun, X.; Sun, X.; Zhuang, L.; Du, Q.; Zhang, B. Hyperspectral anomaly detection based on chessboard topology. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5505016. [Google Scholar] [CrossRef]
  4. Su, Y.; Gao, L.; Jiang, M.; Plaza, A.; Sun, X.; Zhang, B. NSCKL: Normalized Spectral Clustering With Kernel-Based Learning for Semisupervised Hyperspectral Image Classification. IEEE Trans. Cybern. 2023, 53, 6649–6662. [Google Scholar] [CrossRef] [PubMed]
  5. Zhuang, L.; Ng, M.K.; Gao, L.; Michalski, J.; Wang, Z. Eigenimage2Eigenimage (E2E): A Self-Supervised Deep Learning Network for Hyperspectral Image Denoising. IEEE Trans. Neural Netw. Learn. Syst. 2023, 1–15. [Google Scholar] [CrossRef]
  6. Gao, L.; Wang, D.; Zhuang, L.; Sun, X.; Huang, M.; Plaza, A. BS3LNet: A new blind-spot self-supervised learning network for hyperspectral anomaly detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5504218. [Google Scholar] [CrossRef]
  7. Small, C.; Sousa, D. Robust Cloud Suppression and Anomaly Detection in Time-Lapse Thermography. Remote Sens. 2024, 16, 255. [Google Scholar] [CrossRef]
  8. Gui, S.; Song, S.; Qin, R.; Tang, Y. Remote Sensing Object Detection in the Deep Learning Era—A Review. Remote Sens. 2024, 16, 327. [Google Scholar] [CrossRef]
  9. Feng, Y.; Han, B.; Wang, X.; Shen, J.; Guan, X.; Ding, H. Self-Supervised Transformers for Unsupervised SAR Complex Interference Detection Using Canny Edge Detector. Remote Sens. 2024, 16, 306. [Google Scholar] [CrossRef]
  10. Zheng, S.; Wu, Z.; Xu, Y.; Wei, Z.; Plaza, A. Learning Orientation Information From Frequency-Domain for Oriented Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5628512. [Google Scholar] [CrossRef]
  11. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
  12. Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
  13. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar] [CrossRef]
  14. Bilen, H.; Vedaldi, A. Weakly supervised deep detection networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2846–2854. [Google Scholar] [CrossRef]
  15. Tang, P.; Wang, X.; Bai, X.; Liu, W. Multiple instance detection network with online instance classifier refinement. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2843–2851. [Google Scholar] [CrossRef]
  16. Yang, K.; Li, D.; Dou, Y. Towards precise end-to-end weakly supervised object detection network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 7 October–2 November 2019; pp. 8372–8381. [Google Scholar] [CrossRef]
  17. Cheng, G.; Yang, J.; Gao, D.; Guo, L.; Han, J. High-quality proposals for weakly supervised object detection. IEEE Trans. Image Process. 2020, 29, 5794–5804. [Google Scholar] [CrossRef]
  18. Uijlings, J.R.; Van De Sande, K.E.; Gevers, T.; Smeulders, A.W. Selective search for object recognition. Int. J. Comput. Vis. 2013, 104, 154–171. [Google Scholar] [CrossRef]
  19. Zitnick, C.L.; Dollár, P. Edge boxes: Locating object proposals from edges. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 391–405. [Google Scholar] [CrossRef]
  20. Tang, P.; Wang, X.; Bai, S.; Shen, W.; Bai, X.; Liu, W.; Yuille, A. Pcl: Proposal cluster learning for weakly supervised object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 42, 176–191. [Google Scholar] [CrossRef] [PubMed]
  21. Wang, B.; Zhao, Y.; Li, X. Multiple instance graph learning for weakly supervised remote sensing object detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5613112. [Google Scholar] [CrossRef]
  22. Yao, X.; Feng, X.; Han, J.; Cheng, G.; Guo, L. Automatic weakly supervised object detection from high spatial resolution remote sensing images via dynamic curriculum learning. IEEE Trans. Geosci. Remote Sens. 2020, 59, 675–685. [Google Scholar] [CrossRef]
  23. Cheng, G.; Xie, X.; Chen, W.; Feng, X.; Yao, X.; Han, J. Self-guided Proposal Generation for Weakly Supervised Object Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5625311. [Google Scholar] [CrossRef]
  24. Feng, X.; Han, J.; Yao, X.; Cheng, G. TCANet: Triple context-aware network for weakly supervised object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2020, 59, 6946–6955. [Google Scholar] [CrossRef]
  25. Fasana, C.; Pasini, S.; Milani, F.; Fraternali, P. Weakly supervised object detection for remote sensing images: A survey. Remote Sens. 2022, 14, 5362. [Google Scholar] [CrossRef]
  26. Choi, J.; Lee, S.J. Weakly Supervised Learning for Transmission Line Detection Using Unpaired Image-to-Image Translation. Remote Sens. 2022, 14, 3421. [Google Scholar] [CrossRef]
  27. Berg, P.; Santana Maia, D.; Pham, M.T.; Lefèvre, S. Weakly supervised detection of marine animals in high resolution aerial images. Remote Sens. 2022, 14, 339. [Google Scholar] [CrossRef]
  28. Wang, H.; Li, H.; Qian, W.; Diao, W.; Zhao, L.; Zhang, J.; Zhang, D. Dynamic pseudo-label generation for weakly supervised object detection in remote sensing images. Remote Sens. 2021, 13, 1461. [Google Scholar] [CrossRef]
  29. Foulds, J.; Frank, E. A review of multi-instance learning assumptions. Knowl. Eng. Rev. 2010, 25, 1–25. [Google Scholar] [CrossRef]
  30. Huang, Z.; Zou, Y.; Kumar, B.; Huang, D. Comprehensive attention self-distillation for weakly-supervised object detection. Adv. Neural Inf. Process. Syst. 2020, 33, 16797–16807. [Google Scholar]
  31. Lin, C.; Wang, S.; Xu, D.; Lu, Y.; Zhang, W. Object instance mining for weakly supervised object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11482–11489. [Google Scholar] [CrossRef]
  32. Han, J.; Zhang, D.; Cheng, G.; Guo, L.; Ren, J. Object detection in optical remote sensing images based on weakly supervised learning and high-level feature learning. IEEE Trans. Geosci. Remote Sens. 2014, 53, 3325–3337. [Google Scholar] [CrossRef]
  33. Sun, Y.; Ran, J.; Yang, F.; Gao, C.; Kurozumi, T.; Kimata, H.; Ye, Z. Oriented Object Detection For Remote Sensing Images Based On Weakly Supervised Learning. In Proceedings of the 2021 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar] [CrossRef]
  34. Gungor, C.; Kovashka, A. Complementary Cues from Audio Help Combat Noise in Weakly-Supervised Object Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 2185–2194. [Google Scholar] [CrossRef]
  35. Zhu, M.; Wan, S.; Jin, P.; Zhang, P. DFFNet: Dynamic Feature Fusion Network for Weakly Supervised Object Detection in Remote Sensing Images. In Proceedings of the 2022 IEEE International Conference on Big Data (Big Data), Osaka, Japan, 17–20 December 2022; pp. 1409–1414. [Google Scholar] [CrossRef]
  36. Qian, W.; Yan, Z.; Zhu, Z.; Yin, W. Weakly Supervised Part-Based Method for Combined Object Detection in Remote Sensing Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 5024–5036. [Google Scholar] [CrossRef]
  37. Tan, Z.; Jiang, Z.; Guo, C.; Zhang, H. WSODet: A Weakly Supervised Oriented Detector for Aerial Object Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5604012. [Google Scholar] [CrossRef]
  38. Qian, X.; Huo, Y.; Cheng, G.; Gao, C.; Yao, X.; Wang, W. Mining High-Quality Pseudoinstance Soft Labels for Weakly Supervised Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5607615. [Google Scholar] [CrossRef]
  39. Feng, X.; Yao, X.; Cheng, G.; Han, J. Weakly supervised rotation-invariant aerial object detection network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14146–14155. [Google Scholar] [CrossRef]
  40. Ma, L.; Luo, X.; Hong, H.; Zhang, Y.; Wang, L.; Wu, J. Scribble-attention hierarchical network for weakly supervised salient object detection in optical remote sensing images. Appl. Intell. 2023, 53, 12999–13017. [Google Scholar] [CrossRef]
  41. Shamsolmoali, P.; Chanussot, J.; Zareapoor, M.; Zhou, H.; Yang, J. Multipatch feature pyramid network for weakly supervised object detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5610113. [Google Scholar] [CrossRef]
  42. Xie, X.; Cheng, G.; Feng, X.; Yao, X.; Qian, X.; Han, J. Attention Erasing and Instance Sampling for Weakly Supervised Object Detection. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5600910. [Google Scholar] [CrossRef]
  43. Feng, X.; Han, J.; Yao, X.; Cheng, G. Progressive contextual instance refinement for weakly supervised object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8002–8012. [Google Scholar] [CrossRef]
  44. Huo, Y.; Qian, X.; Li, C.; Wang, W. Multiple Instance Complementary Detection and Difficulty Evaluation for Weakly Supervised Object Detection in Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 6006505. [Google Scholar] [CrossRef]
  45. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar] [CrossRef] [PubMed]
  46. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
  47. Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
  48. Li, K.; Cheng, G.; Bu, S.; You, X. Rotation-Insensitive and Context-Augmented Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2337–2348. [Google Scholar] [CrossRef]
  49. Wan, F.; Wei, P.; Jiao, J.; Han, Z.; Ye, Q. Min-entropy latent model for weakly supervised object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1297–1306. [Google Scholar] [CrossRef]
  50. Feng, X.; Yao, X.; Cheng, G.; Han, J.; Han, J. SAENet: Self-Supervised Adversarial and Equivariant Network for Weakly Supervised Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5610411. [Google Scholar] [CrossRef]
  51. Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J. Photogramm. Remote Sens. 2014, 98, 119–132. [Google Scholar] [CrossRef]
  52. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
  53. Cheng, G.; Zhou, P.; Han, J. Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 7405–7415. [Google Scholar] [CrossRef]
  54. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef]
  55. Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
  56. Ren, Z.; Yu, Z.; Yang, X.; Liu, M.Y.; Lee, Y.J.; Schwing, A.G.; Kautz, J. Instance-aware, context-focused, and memory-efficient weakly supervised object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10598–10607. [Google Scholar] [CrossRef]
  57. Ji, R.; Liu, Z.; Zhang, L.; Liu, J.; Zuo, X.; Wu, Y.; Zhao, C.; Wang, H.; Yang, L. Multi-peak Graph-based Multi-instance Learning for Weakly Supervised Object Detection. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2021, 17, 70. [Google Scholar] [CrossRef]
  58. Gao, W.; Wan, F.; Yue, J.; Xu, S.; Ye, Q. Discrepant multiple instance learning for weakly supervised object detection. Pattern Recognit. 2022, 122, 108233. [Google Scholar] [CrossRef]
  59. Xu, Y.; Zhou, C.; Yu, X.; Xiao, B.; Yang, Y. Pyramidal Multiple Instance Detection Network With Mask Guided Self-Correction for Weakly Supervised Object Detection. IEEE Trans. Image Process. 2021, 30, 3029–3040. [Google Scholar] [CrossRef]
  60. Yin, Y.; Deng, J.; Zhou, W.; Li, H. Instance mining with class feature banks for weakly supervised object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 3190–3198. [Google Scholar] [CrossRef]
  61. Jia, Q.; Wei, S.; Ruan, T.; Zhao, Y.; Zhao, Y. Gradingnet: Towards providing reliable supervisions for weakly supervised object detection by grading the box candidates. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 1682–1690. [Google Scholar] [CrossRef]
  62. Li, X.; Yi, S.; Zhang, R.; Fu, X.; Jiang, H.; Wang, C.; Liu, Z.; Gao, J.; Yu, J.; Yu, M.; et al. Dynamic sample weighting for weakly supervised object detection. Image Vis. Comput. 2022, 122, 104444. [Google Scholar] [CrossRef]
  63. Zhang, X.; Yang, Y.; Feng, J. Ml-locnet: Improving object localization with multi-view learning network. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 240–255. [Google Scholar] [CrossRef]
  64. Shen, Y.; Ji, R.; Wang, Y.; Wu, Y.; Cao, L. Cyclic guidance for weakly supervised joint detection and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 697–707. [Google Scholar] [CrossRef]
Figure 1. Visualizations of boosting deep features at image level and instance level. The orange bounding boxes demonstrate that refining features at image level could capture the most discriminative regions in one object (e.g., airfoil and fuselage). The red bounding boxes demonstrate that refining the instance-level features has great potential to perceive the full instance.
Figure 1. Visualizations of boosting deep features at image level and instance level. The orange bounding boxes demonstrate that refining features at image level could capture the most discriminative regions in one object (e.g., airfoil and fuselage). The red bounding boxes demonstrate that refining the instance-level features has great potential to perceive the full instance.
Remotesensing 16 01203 g001
Figure 2. The overview of our proposed method. First, a pre-trained VGG16 backbone network is constructed to extract image features F. Then, an image-level feature refinement branch is constructed to progressively enhance the representation of backbone features. After feeding the enhanced image-level features F A M into RoI Pooling layer, we obtain the original instance-level RoI features R. All the RoI features will be fed into our instance-level feature refinement branch. A set of salient RoIs and general RoIs can be obtained via the proposed salient RoI feature selection ( S R F S ). Finally, the enhanced ( E . ) instance-level RoI features R e are used to attain the proposal classification results of our WSOD method. S u m . , A v e . , C o n c a t , C o n v and F C denote summation, average, concatenation, convolution and fully connection layers, respectively. W q u e r y , W k e y and W v a l u e represent different linear layers, respectively.
Figure 2. The overview of our proposed method. First, a pre-trained VGG16 backbone network is constructed to extract image features F. Then, an image-level feature refinement branch is constructed to progressively enhance the representation of backbone features. After feeding the enhanced image-level features F A M into RoI Pooling layer, we obtain the original instance-level RoI features R. All the RoI features will be fed into our instance-level feature refinement branch. A set of salient RoIs and general RoIs can be obtained via the proposed salient RoI feature selection ( S R F S ). Finally, the enhanced ( E . ) instance-level RoI features R e are used to attain the proposal classification results of our WSOD method. S u m . , A v e . , C o n c a t , C o n v and F C denote summation, average, concatenation, convolution and fully connection layers, respectively. W q u e r y , W k e y and W v a l u e represent different linear layers, respectively.
Remotesensing 16 01203 g002
Figure 3. The overview of our salient RoI feature selection (SRFS). The collection of R A includes the regions which are more likely to contain an informative image fragment of a certain class, which is attained by the predicted category score vectors. All the RoIs where the IoU is greater than 0.5 are put in set R B . Finally, we randomly pick the remaining set of RoIs ( R C ) to supplement the salient RoI feature set.
Figure 3. The overview of our salient RoI feature selection (SRFS). The collection of R A includes the regions which are more likely to contain an informative image fragment of a certain class, which is attained by the predicted category score vectors. All the RoIs where the IoU is greater than 0.5 are put in set R B . Finally, we randomly pick the remaining set of RoIs ( R C ) to supplement the salient RoI feature set.
Remotesensing 16 01203 g003
Figure 4. Visualizations of different methods on NWPU VHR-10.v2. The first row shows the detection results of baseline method and the second row shows the detection results of our proposed method. Taking the third row, visualizations of the ground true, as an indicator, we can clearly see that our method accurately detects the most instances.
Figure 4. Visualizations of different methods on NWPU VHR-10.v2. The first row shows the detection results of baseline method and the second row shows the detection results of our proposed method. Taking the third row, visualizations of the ground true, as an indicator, we can clearly see that our method accurately detects the most instances.
Remotesensing 16 01203 g004
Figure 5. Visualizations of different methods on DIOR. The first row shows the detection results of baseline method and the second row shows the detection results of our proposed method. Taking the third row, visualizations of the ground true, as an indicator, we can clearly see that our method accurately detects the most instances.
Figure 5. Visualizations of different methods on DIOR. The first row shows the detection results of baseline method and the second row shows the detection results of our proposed method. Taking the third row, visualizations of the ground true, as an indicator, we can clearly see that our method accurately detects the most instances.
Remotesensing 16 01203 g005
Figure 6. The numbers of different categories in training and testing sets.
Figure 6. The numbers of different categories in training and testing sets.
Remotesensing 16 01203 g006
Figure 7. Visualization results of different methods. Compared with the baseline (first row), refining deep features at image level (second row) alleviates the problem of missing detection and reduces the ambiguities in object recognition.
Figure 7. Visualization results of different methods. Compared with the baseline (first row), refining deep features at image level (second row) alleviates the problem of missing detection and reduces the ambiguities in object recognition.
Remotesensing 16 01203 g007
Figure 8. Visualization results of different methods. Compared with the baseline (first row), refining deep features at instance level (second row) alleviates the problem of part domination and has great potential for perceiving the complete instances.
Figure 8. Visualization results of different methods. Compared with the baseline (first row), refining deep features at instance level (second row) alleviates the problem of part domination and has great potential for perceiving the complete instances.
Remotesensing 16 01203 g008
Figure 9. Visualizations of different methods on the VOC2007 dataset. The first row shows the detection results of baseline method and the second row shows the detection results of our proposed method. Taking the third row, visualizations of the ground true, as an indicator, we can clearly see that our method accurately detects the most instances.
Figure 9. Visualizations of different methods on the VOC2007 dataset. The first row shows the detection results of baseline method and the second row shows the detection results of our proposed method. Taking the third row, visualizations of the ground true, as an indicator, we can clearly see that our method accurately detects the most instances.
Remotesensing 16 01203 g009
Table 1. Average precision (%) for different methods on the NWPU VHR-10.v2 testing set.
Table 1. Average precision (%) for different methods on the NWPU VHR-10.v2 testing set.
MethodsAirplaneShipStorage
Tank
Baseball
Diamond
Tennis
Court
Basketball
Court
Ground
Track Field
HarborBridgeVehiclemAP
Fully supervised:
COPD [51]62.369.464.582.134.135.384.256.316.444.354.9
Transferred CNN [52]66.057.185.080.935.145.579.462.643.241.359.6
RICNN [53]88.778.386.389.142.356.987.767.562.372.073.1
RCNN [54]85.488.962.819.790.758.268.079.954.249.965.8
Fast RCNN [55]90.990.689.347.3100.085.984.988.280.369.882.7
Faster RCNN [45]90.986.390.598.289.769.6100.080.161.578.184.5
RICO [48]99.790.890.692.990.380.190.880.368.587.187.1
Weakly supervised:
WSDDN [14]30.141.735.088.912.923.999.413.91.93.635.1
OICR [15]13.767.457.255.213.639.792.80.21.83.734.5
PCL [20]26.063.82.589.864.576.177.90.01.315.739.4
MELM [49]80.969.310.590.212.820.199.217.114.28.742.3
DCL [22]72.774.337.182.636.942.384.039.616.835.052.1
DPLG [28]80.910.590.164.469.180.28.714.039.678.353.6
PCIR [43]90.878.836.490.822.652.288.542.411.735.555.0
MIG [21]88.771.675.294.237.647.7100.027.38.39.156.0
TCANet [24]89.478.278.490.835.350.490.942.44.128.358.8
SAENet [50]82.974.550.296.755.772.9100.036.56.331.960.7
WSODet [37]95.375.681.998.020.956.7100.029.85.148.161.3
SPG [23]90.481.059.592.335.651.499.958.717.043.062.8
PSL [38]87.680.157.394.036.480.4100.056.99.835.663.8
Ours90.881.656.691.751.969.5100.053.416.340.565.2
Table 2. Correct localization (%) for different methods on the NWPU VHR-10.v2 trainval set.
Table 2. Correct localization (%) for different methods on the NWPU VHR-10.v2 trainval set.
MethodsAirplaneShipStorage
Tank
Baseball
Diamond
Tennis
Court
Basketball
Court
Ground
Track Field
HarborBridgeVehiclemAP
WSDDN [14]22.336.840.092.518.024.299.314.81.72.935.3
OICR [15]29.483.320.581.840.932.186.67.43.714.440.0
PCL [20]11.850.012.898.784.577.490.70.09.315.645.6
MELM [49]86.077.421.498.310.743.595.040.011.814.649.9
DPLG [28]87.216.896.175.173.286.316.318.746.785.161.5
PCIR [43]100.093.164.199.364.879.389.763.013.352.271.9
MIG [21]97.890.387.298.754.964.2100.074.113.021.670.2
TCANet [24]96.991.895.188.766.962.896.054.219.655.672.8
SAENet [50]97.191.787.898.740.981.1100.070.414.852.273.5
WSODet [37]96.793.282.599.540.558.0100.067.79.773.372.4
SPG [23]98.192.770.199.751.980.196.272.413.060.073.4
PSL [38]94.486.668.597.869.887.5100.068.616.056.674.6
Ours98.694.776.483.961.382.4100.078.119.557.275.2
Table 3. Average precision (%) for different methods on the DIOR testing set.
Table 3. Average precision (%) for different methods on the DIOR testing set.
MethodsAirplaneAirportBaseball
Field
Basketball
Court
BridgeChimneyDamExpressway
Service Area
Expressway
Toll Station
Golf
Field
mAP
WSDDN [14]9.139.737.820.20.312.20.60.711.94.913.3
OICR [15]8.728.344.118.21.320.20.10.729.913.816.5
PCL [20]21.535.259.823.53.043.70.10.91.52.918.2
MELM [49]28.13.262.528.70.162.50.213.128.415.218.7
DCL [22]20.922.754.211.56.061.00.11.131.030.920.2
PCIR [43]30.436.154.226.69.158.60.29.736.232.624.9
MIG [21]22.252.662.825.88.567.40.78.928.757.325.1
TCANet [24]25.130.862.940.04.167.88.123.829.922.325.8
SPG [23]31.336.762.829.16.162.70.315.030.135.025.8
SAENet [50]20.662.762.723.57.664.60.234.530.655.427.1
WSODet [37]32.253.366.576.60.157.10.10.10.442.827.3
PSL [38]29.149.870.941.47.245.50.235.436.860.828.6
Ours32.970.563.245.70.269.70.212.439.456.429.1
MethodsGround
Track Field
HarborOverpassShipStadiumStorage
Tank
Tennis
Court
Train
Station
VehicleWindmillmAP
WSDDN [14]42.44.71.10.763.04.06.10.54.61.113.3
OICR [15]57.410.711.19.159.37.10.70.19.10.416.5
PCL [20]56.416.811.19.157.69.12.50.14.64.618.2
MELM [49]41.126.10.49.18.615.020.69.80.10.518.7
DCL [22]56.55.12.79.163.79.110.40.07.30.820.2
PCIR [43]58.58.621.612.164.39.113.60.39.17.524.9
MIG [21]47.723.80.86.454.113.24.114.80.22.425.1
TCANet [24]53.924.811.19.146.413.731.01.59.11.025.8
SPG [23]48.027.112.010.060.015.121.09.93.20.125.8
SAENet [50]52.717.66.99.151.615.41.714.41.49.227.1
WSODet [37]66.60.11.92.052.622.468.80.21.20.327.3
PSL [38]48.514.025.118.548.911.711.93.511.31.728.6
Ours55.316.60.69.154.818.111.016.19.11.129.1
Table 4. Correct localization (%) for different methods on the DIOR trainval set.
Table 4. Correct localization (%) for different methods on the DIOR trainval set.
MethodsAirplaneAirportBaseball
Field
Basketball
Court
BridgeChimneyDamExpressway
Service Area
Expressway
Toll Station
Golf
Field
mAP
WSDDN [14]5.759.994.255.94.923.41.06.844.512.832.4
OICR [15]16.051.594.855.83.623.90.04.856.722.434.8
PCL [20]61.146.995.463.67.395.10.25.75.150.841.5
MELM [49]77.028.992.763.013.090.10.217.037.944.643.3
PCIR [43]81.651.396.273.55.094.715.932.846.048.648.4
MIG [21]77.046.995.463.623.095.10.217.057.950.846.8
TCANet [24]91.269.495.567.518.997.80.270.554.351.449.4
SPG [23]80.532.098.765.015.296.122.517.046.151.048.3
SAENet [50]91.269.495.567.518.997.80.270.554.351.449.4
WSODet [37]95.281.095.788.05.994.11.41.13.792.149.5
Ours88.369.198.869.419.997.80.324.756.254.452.3
MethodsGround
Track Field
HarborOverpassShipStadiumStorage
Tank
Tennis
Court
Train
Station
VehicleWindmillmAP
WSDDN [14]89.95.510.023.098.579.615.13.511.63.232.4
OICR [15]91.418.218.731.898.381.37.51.215.82.034.8
PCL [20]89.442.119.837.997.980.713.80.210.56.941.5
MELM [49]88.149.415.728.298.383.022.810.34.62.243.3
PCIR [43]85.338.920.230.684.691.556.33.810.51.348.4
MIG [21]89.442.119.837.997.980.713.810.310.56.946.8
TCANet [24]88.348.02.333.614.183.465.619.916.42.949.4
SPG [23]89.249.522.035.298.690.032.612.710.02.348.3
SAENet [50]88.348.02.333.614.183.465.619.916.42.949.4
WSODet [37]95.31.413.643.195.990.789.40.217.15.749.5
Ours89.649.119.434.596.784.763.215.611.63.452.3
Table 5. Mean average precision (%) and correct localization (%) for different methods on NWPU VHR-10.v2.
Table 5. Mean average precision (%) and correct localization (%) for different methods on NWPU VHR-10.v2.
MethodsIMINmAPCorLoc
w.o. MAMLw. MAMLw.o. ISLw. ISL
A 56.370.4
B 61.471.5
C 62.372.6
D 61.372.2
E 62.073.1
F 65.275.2
Table 6. Mean average precision (%) and correct localization (%) for different methods on DIOR.
Table 6. Mean average precision (%) and correct localization (%) for different methods on DIOR.
MethodsIMINmAPCorLoc
w.o. MAMLw. MAMLw.o. ISLw. ISL
A 24.646.3
B 25.549.5
C 26.750.3
D 26.450.6
E 27.349.1
F 29.152.3
Table 7. Mean average precision (%) and correct localization (%) for different WSOD methods on the VOC2007 and VOC2012 datasets.
Table 7. Mean average precision (%) and correct localization (%) for different WSOD methods on the VOC2007 and VOC2012 datasets.
Method07-AP5007-Cor12-AP5012-Cor
REG [16]48.666.846.869.5
MPG-MIL [57]50.467.146.967.4
PG + PS [17]51.169.248.368.7
D-MIL [58]53.568.749.670.1
Xu et al. [59]53.969.852.873.3
CFB [60]54.370.749.469.6
BGM + IBM [61]54.372.150.571.9
DSW [62]54.773.353.869.6
Ren et al. [56]54.968.852.170.9
PSL [38]56.370.3
Ours57.571.053.971.3
Table 8. Mean average precision (%) and correct localization (%) for different single-model-based and ensemble-based methods on the COCO dataset.
Table 8. Mean average precision (%) and correct localization (%) for different single-model-based and ensemble-based methods on the COCO dataset.
MethodsSingleEnsemble
AP50AP75APAP50AP75AP
OICR [15]14.36.17.716.18.18.4
ML-LocNet [63]16.2
MELM [49]18.87.67.820.79.19.7
PCL [20]19.47.38.519.69.2
REG [16]19.68.19.321.29.710.8
WS-JDS [64]20.310.5
PG + PS [17]20.7
Ren et al. [56]25.810.512.4
Ours26.912.412.927.813.414.7
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zheng, S.; Wu, Z.; Xu, Y.; Wei, Z. Weakly Supervised Object Detection for Remote Sensing Images via Progressive Image-Level and Instance-Level Feature Refinement. Remote Sens. 2024, 16, 1203. https://doi.org/10.3390/rs16071203

AMA Style

Zheng S, Wu Z, Xu Y, Wei Z. Weakly Supervised Object Detection for Remote Sensing Images via Progressive Image-Level and Instance-Level Feature Refinement. Remote Sensing. 2024; 16(7):1203. https://doi.org/10.3390/rs16071203

Chicago/Turabian Style

Zheng, Shangdong, Zebin Wu, Yang Xu, and Zhihui Wei. 2024. "Weakly Supervised Object Detection for Remote Sensing Images via Progressive Image-Level and Instance-Level Feature Refinement" Remote Sensing 16, no. 7: 1203. https://doi.org/10.3390/rs16071203

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop