Bayes R-CNN: An Uncertainty-Aware Bayesian Approach to Object Detection in Remote Sensing Imagery for Enhanced Scene Interpretation

Sharifuzzaman, Sagar A. S. M.; Tanveer, Jawad; Chen, Yu; Chan, Jun Hoong; Kim, Hyung Seok; Kallu, Karam Dad; Ahmed, Shahzad

doi:10.3390/rs16132405

Open AccessArticle

Bayes R-CNN: An Uncertainty-Aware Bayesian Approach to Object Detection in Remote Sensing Imagery for Enhanced Scene Interpretation

by

Sagar A. S. M. Sharifuzzaman

^1,†

,

Jawad Tanveer

^2,†,

Yu Chen

³,

Jun Hoong Chan

⁴,

Hyung Seok Kim

¹,

Karam Dad Kallu

⁵ and

Shahzad Ahmed

^6,*

¹

Department of AI and Robotics, Sejong University, Seoul 05006, Republic of Korea

²

Department of Computer Science and Engineering, Sejong University, Seoul 05006, Republic of Korea

³

Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo 315201, China

⁴

School of Computer Science, Peking University, Beijing 100871, China

⁵

Department of Robotics and Artificial Intelligence (R&AI), School of Mechanical and Manufacturing Engineering (SMME), National University of Sciences and Technology (NUST), Islamabad H-12, Pakistan

⁶

Department of Electronic Engineering, Hanyang University, Seoul 04763, Republic of Korea

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2024, 16(13), 2405; https://doi.org/10.3390/rs16132405

Submission received: 25 April 2024 / Revised: 25 June 2024 / Accepted: 28 June 2024 / Published: 30 June 2024

Download

Browse Figures

Versions Notes

Abstract

:

Remote sensing technology has been modernized by artificial intelligence, which has made it possible for deep learning algorithms to extract useful information from images. However, overfitting and lack of uncertainty quantification, high-resolution images, information loss in traditional feature extraction, and background information retrieval for detected objects limit the use of deep learning models in various remote sensing applications. This paper proposes a Bayes by backpropagation (BBB)-based system for scene-driven identification and information retrieval in order to overcome the above-mentioned problems. We present the Bayes R-CNN, a two-stage object detection technique to reduce overfitting while also quantifying uncertainty for each object recognized within a given image. To extract features more successfully, we replace the traditional feature extraction model with our novel Multi-Resolution Extraction Network (MRENet) model. We propose the multi-level feature fusion module (MLFFM) in the inner lateral connection and a Bayesian Distributed Lightweight Attention Module (BDLAM) to reduce information loss in the feature pyramid network (FPN). In addition, our system incorporates a Bayesian image super-resolution model which enhances the quality of the image to improve the prediction accuracy of the Bayes R-CNN. Notably, MRENet is used to classify the background of the detected objects to provide detailed interpretation of the object. Our proposed system is comprehensively trained and assessed utilizing the state-of-the-art DIOR and HRSC2016 datasets. The results demonstrate our system’s ability to detect and retrieve information from remote sensing scene images.

Keywords:

remote sensing; Bayesian object detection; FPN; deep learning; attention module

1. Introduction

Remote sensing technology, a vital tool for observing and comprehending our environment, has seen significant change throughout the years. Deep learning algorithms have emerged to be effective tools for analyzing remote sensing images and extracting valuable data, as a result of substantial changes brought about by advances in artificial intelligence. Rapid advancements in deep learning algorithms have enabled a variety of object detection techniques that have recently been implemented in remote sensing applications such as defense [1], agriculture [2], resource exploration [3], surveillance [4], disaster management [5], and environmental monitoring [6].

Two types of remote sensing object detection methods exist which are two-stage and one-stage. Faster R-CNN [7] and R-FCN [8] are two-stage approaches that use a region proposal network (RPN) to identify regions of interest in an image, followed by classification and regression procedures. Despite their excellent precision, these approaches are inefficient because of the significant calculation required. One-stage approaches, on the other hand, such as SSD [9] and YOLO [10], enhance inference speed by acquiring the final position and classification prediction at the same time, albeit with a minor loss of precision. Regardless of their building-block structure, most conventional object detection methods use a horizontal bounding box proposal for the detected objects. However, due to the complexity of remote sensing images, such as their top-down aspect and high resolution, typical horizontal object detection algorithms struggle to acquire excellent results.

Researchers proposed rotating bounding box proposal-based object detection approaches to increase the performance of object detection models in complex remote sensing scenarios. Existing rotating object detection algorithms use predefined anchor boxes with variable scales, aspect ratios, and angles to identify the arbitrary orientation and multiscale objects. To decrease computation, some algorithms, such as RoI-Transformer [11], learn the transition from horizontal to rotational RoIs. Similarly, Oriented R-CNNs [12] use rotation-capable bounding boxes to adapt to changing object orientations in remote sensing imagery. Other approaches, such as CAD-Net [13] and Info-FPN [14], build robust feature representations using contextual information or multiscale representation. Other approaches have been used to improve object detection in remote sensing images, such as refining sampling to increase small-item detection and building feature refinement modules for the correct alignment of the rotation anchor box. Nonetheless, despite these promising improvements, a number of problems remain that limit the use of deep learning-based object detection systems in remote sensing applications.

One of the most common problems in object detection is overfitting, which occurs when models perform well on training data but badly on unseen data. The lack of uncertainty quantification in object detection, which is critical for risk-sensitive applications, exacerbates the problem. Furthermore, standard feature extraction algorithms frequently fail to adequately capture and extract multiscale features, resulting in information loss and unsatisfactory model performance. In addition, the models’ effectiveness is hampered by a lack of high-resolution images, which are required for accurate object detection. Finally, retrieving background information for detected objects is a critical challenge in refining these models for diverse remote sensing applications.

The abovementioned issues are discussed in detail below:

Overfitting and uncertainty quantification:

Overfitting happens when a model learns the precise characteristics and noise in the training dataset to such an extent that it significantly impairs the model’s performance on unseen, real-world data during its training phase. As a result, the model performs well on training data while struggling to sustain that level of performance on new, previously unknown data. This has severe ramifications in remote sensing because these models are frequently employed to detect crucial environmental and geographical aspects. An overfit model may miss crucial traits or mistakenly identify irrelevant ones, resulting in incorrect interpretations and conclusions. Furthermore, the absence of uncertainty quantification indicates that the model’s predictions lack a measure of confidence or error estimation. In other words, it is impossible to determine how reliable the model’s prediction is for unseen data, which is especially problematic for risk-sensitive applications where knowing the margin of error is critical. As a result, several strategies must be investigated that can not only prevent overfitting but also enable uncertainty estimation of detected objects;

2.: Information loss in feature extraction

A major concern in the field of remote sensing object detection is the issue of information loss during feature extraction, which serves as the basis for effective object detection. Traditional feature extraction models, which are frequently used as the backbone networks for object detection systems, fail to acquire and preserve all of the important information from remote sensing images. This information loss can have a severe influence on object detection performance since key cues for identifying objects may be missed during feature extraction. Moreover, the feature pyramid network (FPN), a popular component of these systems, faces comparable difficulties. The FPN’s inner lateral connections, which are intended to integrate high-level semantic information with low-level specific information, also suffer from information loss. This can limit the network’s ability to handle objects of varying sizes.

Super-resolution techniques can be used to address this issue. These techniques try to generate a higher-resolution image from low-resolution images, effectively enhancing the level of detail and boosting object detection accuracy. Furthermore, the use of these techniques allows for the investigation of the effect of higher resolution on the detection accuracy of object detection models. This type of study can provide useful insights into the relationship between image resolution and identification performance, driving future advancements, particularly those with minute or delicate characteristics. As a result, it is critical to propose a new backbone network and modify the FPN’s inner lateral connection module;

3.: Remote sensing image resolution

High-resolution images are critical for accurate object detection and interpretation in remote sensing images. However, one of the most common problems is the lack of high-resolution images. Remote sensing platforms may not always be able to record and send high-resolution images due to different constraints such as sensor limitations, transmission bandwidth, or storage capacity. The use of lower-resolution images in object detection might result in the loss of fine features, lowering the performance of detection algorithms and resulting in inaccurate or missed detections.

4.: Background interpretation

The implementation of remote sensing applications relies significantly not only on the detected objects but also on the context or background information surrounding them. However, existing object detection models frequently neglect the substantial information contained in the background and only prioritize the objects of interest. This lack of context can restrict the interpretability and applicability of detection results in a variety of scenarios. For instance, understanding the environmental background (urban, forest, water bodies, etc.) might have a big impact on how the identified objects (vehicles, buildings, plants, etc.) in the images are interpreted. In addition, certain applications, such as disaster management and land use planning, require comprehensive spatial data that includes both the objects of interest and their surroundings. To circumvent this limitation, background classification must be incorporated into the object detection model. This strategy would enable a more comprehensive comprehension of the detected scene by providing additional context, thereby enhancing the overall effectiveness and adaptability of remote sensing applications.

In this paper, we propose a Bayes by backpropagation (BBB)-based system for extracting detection and classification information from remote sensing images. First, we presented the Bayes R-CNN object detection model, which uses a BBB convolution layer to prevent overfitting and enables our model to compute epistemic and aleatory uncertainty for each detected object. A novel backbone model called the multi-resolution extraction network (MRENet) is presented to effectively extract features from remote sensing images. In addition, the traditional FPN is modified by incorporating the multi-level feature fusion module (MLFFM) in the inner lateral connection part to improve feature extraction and by introducing a novel local–global attention module called the Bayesian distributed lightweight attention module (BDLAM) to extract both local and global features from feature maps. Before the prediction phase, the system also incorporates Bayesian super-image-resolution techniques to improve image quality and system performance. In the final stage, the background information extractor based on MRENet is added to provide background information for the detected images. The abbreviation of the used method in this study is given in Table 1.

This paper’s primary contributions are outlined below:

This paper introduces the Bayes R-CNN object detection framework, designed to enhance the reliability of object detection within remote sensing imagery through mitigating overfitting and offering estimates of uncertainty;
A novel backbone network called MRENet is proposed to replace traditional backbones, and the lateral connection of the FPN is replaced with MLFFM to preserve relevant features;
BDLAM is proposed in the FPN to improve the feature extraction and reduce the positional information loss during the feature extraction;
Bayesian super image resolution is embedded in the prediction step to generate high-resolution images from low-resolution images prior to object detection to improve the detection performance;
The MRENet is then implemented on the predicted images to classify the background of the detected object to provide robust interpretation of remote sensing scene.

2. Related Works

2.1. Deep Learning-Based Object Detectors in Remote Sensing

The effectiveness of object detection in real-world scenarios has been substantially improved by advances in deep learning, with Convolutional Neural Network (CNN)-based detectors showing particularly impressive results. Due to its high degree of precision, the two-stage detection method supported by R-CNNs has become a popular option in these contexts. It has also been seen that one-stage algorithms are remarkably effective at detection tasks and provide a higher inference rate than two-stage object detectors.

Recently, existing object detector models were applied to the state-of-the-art DIOR and HRSC2016 datasets. Li et al. first proposed the DIOR dataset, which contains high-resolution images of 20 distinct objects [15]. Huyan et al. proposed a lightweight one-stage model that utilized SNET as its backbone to extract and integrate features effectively [16]. This approach obtained 66.5% accuracy on the DIOR dataset, demonstrating its potential effectiveness for remote sensing imagery despite its lightweight nature. Huang et al. introduced a Cross-Scale Feature-Fusion Pyramid Network with a Cross-Scale Fusion Module and Thinning U-Shaped Modules [17]. With a respectable mAP of 67.25%, this method is designed to manage the challenge of negative sample proliferation during fusion. Sun et al. modified effectively the widely used YOLOv3 model to enhance the detection of small objects [18]. By adding a detection layer on top of the three existing detection layers in the YOLO model, accuracy was enhanced, and an mAP of 69.3% was attained on the DIOR dataset. Liu et al. introduced a one-stage object detection framework that incorporates a feature pyramid network and achieved 69.49% accuracy when evaluated against the DIOR dataset [19]. Yuan et al. proposed introduced the MFPNet, a model developed to address the challenges of scale variance and the dispersion of objects in remote sensing images and attained an mAP of 71.2% [20]. Xu et al. developed a feature-aligned single-shot detector for remote sensing images with an mAP of 71.1% [21]. Their methods provided innovative solutions, albeit with a few limitations, particularly in the handling of oriented objects in remote sensing images. Jiaqi et al. developed an end-to-end solution to improve the performance on the DIOR dataset and acquired an mAP of 71.8% [22]. Wang et al. introduced a comprehensive network for object detection, which integrates a multiscale enhancement network and achieved 71.8% accuracy [23]. Cheng et al. obtained an impressive mAP of 73.3% by incorporating an Aware Feature Pyramid Network and a Group Assignment Strategy into their novel framework [24]. This framework showed promise for enhancing feature pyramid interactions and label assignment. Dong et al. achieved 73.6% accuracy by enhancing the FPN through the incorporation of a deformable convolutional module in the network [25]. In addition, they added Attention-based Multi-Level Feature Fusion Modules to the FPN, which substantially improved the model’s accuracy.

Several studies have been conducted to improve the efficacy of object detection on the HRSC2016 dataset. Xiao et al. proposed a one-stage, anchor-free method for detecting oriented objects that minimize computational complexity by using per-pixel prediction [26]. This method predicted the object’s axis and width and utilized an aspect-ratio-aware orientation centeredness to enhance detection performance, achieving an mAP of 78.51%. Chen et al. introduced a novel method known as Pixels-IoU Loss for detecting oriented bounding box objects [27]. It considered both angle and intersection-over-union (IoU) to improve oriented bounding box regression, resulting in an mAP of 89.20% and substantially improved performance on objects with high aspect ratios and complex backgrounds. Using a similar theme, Xiao et al. proposed a feature decoupling and localization refinement network that effectively handled scale variability and arbitrary orientation challenges, obtaining an mAP of 89.4% [28]. In addition, Lang et al. developed a Dense One-stage Anchor-Free Deep Network with a simpler, more easily optimized architecture than two-stage models [29]. It predicted bounding boxes on a dense grid over the input image and obtained an mAP of 89.76, which is comparable to existing object detectors. Lyu et al. presented an efficient real-time object detector, RTMDet, that outperformed the YOLO series in terms of performance [30]. It possessed a balanced backbone and neck architecture based on large-kernel depth-wise convolutions and utilized soft labels for enhanced dynamic label assignment precision, achieving an mAP of 90.60%. Wang et al. conducted an empirical investigation of Remote Sensing Pretraining using MillionAID, the largest remote sensing (RS) scene recognition dataset [31]. With an impressive mAP of 90.4%, they generated RS-pretrained backbones from diverse network architectures. Li et al. presented the Large Selective Kernel Network for remote sensing object detection, which dynamically adjusted its large spatial receptive field to better model the contexts of various objects [32]. Utilizing unique prior knowledge prevalent in remote sensing scenarios, they attained the highest mAP of 90.65 on the HRSC2016 dataset by utilizing this knowledge.

The aforementioned solutions for object detection in remote sensing images have made significant progress, but there are still challenges that our proposed Bayes R-CNN object detection model and associated techniques can address. Some existing models, for instance, have issues with robustness and overfitting, which may limit their performance and generalization to unknown and open set data. By reducing overfitting and providing uncertainty estimates, the Bayes R-CNN model increases the robustness of object detection in remote sensing images. Consequently, Bayes R-CNN can result in more dependable and robust detection performance on remote sensing images.

2.2. Multiscale Feature Extraction

The extraction of multiscale and resolution-based characteristics from images is required for remote sensing object identification. Several strategies are being developed to extract complicated characteristics from remote sensing images while minimizing information loss. Chen et al. developed a pixel shuffle-based lateral connection module to preserve channel information during feature extraction [14]. The PSM is capable of successfully fusing features from different scales and channels, resulting in increased detection performance. Li et al. introduced a novel rotation detector for remote sensing images, which uses a mask branch to predict form masks and, as a result, generate rotation bounding boxes of objects [33]. An enhanced feature pyramid network is used to manage size disparities, and a mix of position and channel attention networks are used to improve tiny object detection in complicated scenes. By integrating detail and semantic characteristics via an extra Multiscale Feature Fusion Layer, Zhang et al. presented a Multiscale Feature Fusion Network based on a CNN to improve object detection in remote sensing images [34]. Ghiasi et al. developed a strategy for improving the learning of scalable feature pyramid architectures by including all cross-scale connections [35]. Zhang et al. focused on different types of objects for enhanced information extraction using a spatial-and-scale-aware attention model [13]. Guo et al. used consistent supervision, residual feature augmentation, and soft RoI selection to overcome information loss and semantic gaps in FPN [36]. Tan et al. improved feature representation with BiFPN by using the residual connection and weighted fusion to provide weighted attention to distinct feature levels [37]. Vo et al. created a feature aggregation and refinement module to solve the imbalance between high-level and low-level features [38]. They proposed an offset adaptation module to harmonize multi-level feature information in response to the substantial size differences in remote sensing objects caused by varied viewpoints and altitudes. Liu et al. and Shi et al. improved feature hierarchies in remote sensing images and recorded object symmetry, respectively [39,40]. Luo et al. used PixelShuffle to reduce channel information loss [41], while Liu et al. presented a gated ladder shaped FPN for feature fusion [42].

3. Preliminaries

3.1. Bayesian Convolutional Neural Network

Conventional convolutional neural networks suffer from overfitting and overconfidence issues in predictions for small training of the data. Moreover, conventional neural networks do not provide uncertainty estimation in the predictions, which is crucial to remote sensing image scene understanding. The conventional convolutional neural network cascaded with a Bayesian neural network can be used to overcome these issues. Bayesian convolutional neural network approximates the posterior probability distribution

p (w | D)

over the weights using variational inference

q_{θ} (w | D)

. The variational inference contains the gaussian distributions

N (θ | μ, σ^{2})

, where

μ \in R^{d}

and

σ \in R^{d}

. The uncertainty of a model can be calculated using these gaussian variational posterior probability distributions in a Bayesian convolutional neural network. This study implemented Bayes by backpropagation with variational inference on the standard convolutional neural network to improve the performance of the Bayes R-CNN [43].

3.2. Variational Inference

A predictive function for a given input can be defined as

y = f (x)

, where x is the input data and y is the output. Bayesian inference can be implemented on a predictive function as a prior distribution to find the posterior distribution for a given dataset

p (f | X, Y)

. The posterior distributions for unseen data

x^{*}

using Bayesian inference can be defined as follows:

p (y^{*} | x^{*}, X, Y) = \int p (y^{*} | f^{*}) p (f^{*} | x^{*}, X, Y) d f^{*} .

(1)

Although the above equation is intractable, it can be approximated through the random variables w using the following equation:

p (y^{*} | x^{*}, X, Y) = \int p (y^{*} | f^{*}) p (f^{*} | x^{*}, w) p (w | X, Y) d f^{*} d w .

(2)

The variational distribution is then used to approximate the

p (w | X, Y)

. However, the approximated distribution needs to be similar to the original posterior distribution. Therefore, Kullback–Leibler divergence is minimized to approximate the predictive distribution as follows:

q (y^{*} | x^{*}) = \int p (y^{*} | f^{*}) p (f^{*} | x^{*}, w) q (w) d f^{*} d w .

(3)

The Kullback–Leibler divergence for variational inference can be defined as follows:

K L_{V I} = \int q (θ) p (F | X, w) \log p (Y | F) d F d w - K L (q (w) ‖ p (w)) .

(4)

The above equation is minimized to acquire the maximum log evidence lower bound, which provides good representation data to reduce overfitting.

3.3. Bayes by Backpropagation with Variational Inference

Bayes by backpropagation is used to learn the posterior distribution over the weights using the backpropagation method. As defined earlier, Kullback–Leibler divergence is used to approximate the variable posterior distribution, which can be later used to get optimal parameters of a model as follows:

θ_{o p t} = a r g_{θ} \min K L \int [q_{θ} (w | D) ‖ p (w | D)] = a r g_{θ} \min K L \int [q_{θ} (w | D) ‖ p (w | D)] - E_{q (w | θ)} [\log p (D | w)] + \log p (D),

(5)

where KL can be defined as follows:

K L = [q_{θ} (w | D) ‖ p (w)] = \int q_{θ} (w | D) \log \frac{q_{θ} (w | D)}{p (w)} d w .

(6)

The tractable cost function is then defined, which is minimized during the training phase of a Bayesian model,

F (D, θ) \approx \sum_{i = 1}^{n} [\log q_{θ} (w^{(i)} | D) - \log p (w^{(i)}) - \log p (D | w^{(i)})],

(7)

where variational posterior

\log q_{θ} (w^{(i)} | D)

and log prior can be defined as follows:

\log q_{θ} (w^{(i)} | D) = \sum_{i} \log N (w_{i} | μ, σ^{2}),

(8)

\log p (w^{i}) = \sum_{i} \log N (w_{i} | 0, σ^{2}) .

(9)

3.4. Uncertainty Estimation in Bayesian CNN

Bayesian CNNs, unlike conventional CNNs, can provide uncertainty estimation for both the model and the data. This uncertainty can be categorized into two types: aleatoric and epistemic uncertainty. Aleatoric uncertainty, also known as data uncertainty, arises from the inherent noise in the data themselves and cannot be reduced by collecting more data. Epistemic uncertainty, or model uncertainty, is due to limited data and can be reduced as more data become available. Formally, uncertainty in Bayesian CNNs is estimated using the unbiased prediction distribution, defined as follows:

E_{q} [p_{D} (y^{*} | x^{*})] = \int q_{θ} (w | D) p_{w} (y | x) d w \approx \frac{1}{T} \sum_{t = 1}^{T} p_{w_{t}} (y^{*} | x^{*})

(10)

where

p_{w_{t}} (y^{*} | x^{*})

represents the predictive distribution obtained from the model parameters

w_{t}

sampled from the posterior distribution

q_{θ} (w | D)

.

The variance of this predictive distribution is used to calculate aleatoric and epistemic uncertainty as follows:

V a r_{q} (p (y^{*} | x^{*})) = \frac{1}{T} \sum_{t = 1}^{T} d i a g ({\hat{p}}_{t}) - {\hat{p}}_{t} {\hat{p}}_{t}^{T} + \frac{1}{T} \sum_{t = 1}^{T} ({\hat{p}}_{t} - \bar{p}) - {(\hat{p_{t}} - \bar{p})}^{T},

(11)

where

\bar{p} = \frac{1}{T} \sum_{t = 1}^{T} {\hat{p}}_{t}

and

{\hat{p}}_{t} = S o f t m a x (f_{w_{t}} (x^{*}))

.

Aleatoric uncertainty is captured by the term

\frac{1}{T} \sum_{t = 1}^{T} d i a g ({\hat{p}}_{t}) - {\hat{p}}_{t} {\hat{p}}_{t}^{T}

which reflects the inherent noise in the data. Epistemic uncertainty, on the other hand, is represented by the term

\frac{1}{T} \sum_{t = 1}^{T} ({\hat{p}}_{t} - \bar{p}) - {(\hat{p_{t}} - \bar{p})}^{T}

which accounts for the uncertainty in the model parameters.

4. Uncertainty aware Object Detection Method

The proposed remote sensing scene understanding system’s overall structure is shown in Figure 1. Our proposed system has been trained and assessed on remote sensing datasets, including DIOR and HRSC2016. These datasets feature images of diverse objects set against complex backgrounds. The detection dataset is split into training and validation data prior to training the Bayes R-CNN object detection model. The same dataset is also used to construct a new background information extraction dataset to distinguish background. The classification is subsequently sent into the data augmentation procedure, which improves the dataset’s quality. Auto augmentation is utilized to automatically enhance data depending on picture features. In order to analyze the performance of the prediction of the proposed system after implementing the image super-resolution technique, the Bayesian image super-resolution method is added to the prediction part of the object detection system. The uncertainty estimation (epistemic and aleatoric) is calculated after detecting objects in each image.

4.1. Bayes R-CNN

The Bayes R-CNN adheres to the structure of two-stage object detection models. In this framework, Bayesian inference is applied to the FPN, RPN, and the final classification layer to capture uncertainty and mitigate overfitting. An innovation within this model is that MLFFM replaces the traditional lateral connection of the FPN, enhancing its feature representation capabilities. Further, we introduce BDLAM after the FPN to extract more insightful features to be processed by the RPN. The standard convolutional layer in the RPN is replaced by a Bayesian Convolutional Neural Network, enabling a Gaussian distribution over the weights. The Bayes R-CNN is constituted of a feature extractor with an FPN, an RPN, a multiscale ROI-aligned module, and a fully connected layer for final prediction. Each of these components will be explained in the following subsections.

4.2. Feature Extraction with Feature Pyramid Network

4.2.1. MRENet

The available feature extractor for remote sensing images faces challenges in recognizing tiny images with sophisticated backgrounds. Existing CNN models are designed to capture and process low-level features of input images before feeding them into an FPN. However, these models often fail to effectively capture and utilize fine-grained details and high-level contextual information simultaneously, which is crucial for remote sensing applications. Therefore, we have proposed a novel CNN model to replace common backbone networks with our proposed CNN model.

Figure 2 shows the overall architecture of the MRENet, which consists of stem, body, and block parts. The body includes four stages, and every stage consists of a different number of blocks. The first and second stages each comprise three blocks, the third stage is designed with nine blocks, and the final stage consists of three blocks. Every block consists of several convolution layers and an attention network.

The detection task of remote sensing images requires sophisticated models to capture fine-grained information from the input images. The initial layer of a deep learning model called the stem is very crucial in extracting useful features from input image prior to passing feature maps to the body part of the deep learning model. Moreover, the stem module contributes highly to the overall performance of the backbone network; therefore, the design of the stem plays an important role in the ability of the backbone network to handle complex object detection tasks. Traditional stem modules are composed of a sequence of convolution layers with a fixed depth, but these modules do not fully leverage the unique characteristics of remote sensing images. Therefore, we have proposed a novel dynamic hierarchical feature learning (DHFL) stem module for remote sensing object detection tasks.

Figure 3 shows the overall structure of the proposed DHFL stem module which is mainly composed of four main parts such as multi-resolution processing, depth prediction, stem modules, and feature map fusion. In contrast, the base method utilizes a simpler structure with a single convolutional layer, batch normalization, and ReLU activation. The hierarchical design of the DHFL stem module processes input data at multiple scales to produce a coherent output feature map, providing a more intricate and adaptive approach compared to the base method. In this study, we used input image

X

of dimensions

(224 \times 224 \times C

) and generate

M

downsampled images at four different scales:

X_{1} (224 \times 224 \times C), X_{2} (112 \times 112 \times C), X_{3} (56 \times 56 \times C)

, and

X_{4}

(28 \times 28 \times C)

with a scaling factor of 1, 2, 4, and 8, respectively. The detailed description of each part of the DHFL stem module is provided below:

Multi-resolution Input Processing: The multi-resolution input processing part takes the input image (X) of dimensions $(H \times W \times C)$ and generates M downsampled images with various scales. These down sampled images are denoted as $X_{1}, X_{2}, \dots, X_{M}$ , where each image has a dimension of $(H_{i} \times W_{i} \times C)$ . This part enables the model to capture information from the input data at various resolutions to extract both fine-grained details and high-level contextual information;
Depth Prediction: The depth prediction network, $P_{i},$ is employed to predict the optimal depth $K_{i}$ for the stem module for every scale. The depth prediction part takes the down sampled input data $X_{i}$ as input and outputs the predicted depth $K_{i}$ , which is an integer value ranging from the minimum to the maximum allowable depth ( $k_{m i n} t o k_{m a x}$ ). The proposed method allows the stem module to better handle the complexity of the input data at each scale;
Stem Modules: The stem module, $F_{1}^{K_{1}}$ , processes the downsampled input data $X_{i}$ using the dynamically adjusted depth $K_{i}$ . The output of each stem module is a feature map $Y_{i}$ of dimensions $(H_{i} \times W_{i} \times C^{'}) .$ The output feature maps from stem modules contain scale-specific information that has been adaptively processed;
Feature Map Fusion: The output feature maps $Y_{1}, Y_{2}, \dots, Y_{M}$ from each scale’s stem module are resized to a specific spatial resolution $(H^{'} \times W^{'})$ . The resized feature maps, $Z_{1}, Z_{2}, \dots, Z_{M}$ , are then concatenated along the channel axis, resulting in a combined feature map Z of dimensions $(H^{'} \times W^{'} \times (M * C^{'}))$ . A convolutional layer, G, processes the concatenated feature map Z and generates the final output feature map $Z_{o u t}$ with dimensions $(H^{'} \times W^{'} \times C)$ . The fusion parts integrate the information from different resolutions into a single feature map that is fed to the body layer of the proposed model for further processing.

Figure 4 offers a side-by-side visual comparison between the proposed DHFL stem module and the base stem module. The representation without the DHFL stem module shows evident gaps in feature extraction from certain intricate regions of the image. Moreover, false positive and false negative feature extraction was also observed in the base stem module. On the other hand, the image processed with the DHFL stem module vividly captures and emphasizes all intricate details, showcasing the module’s effectiveness in complex scene images. This visual evidence underscores the pivotal role of the DHFL stem module in boosting the model’s accuracy in detecting and processing nuanced features within intricate visual contexts.

Figure 5 represents the structure of the individual block of the proposed CNN backbone model. The block is divided into a channel–spatial attention submodule and a Multilayer Perceptions (MLP) submodule. The input is initially passed through a 1 × 1 convolution layer with a stride size of 1. The output is then passed through a 3 × 3 convolution layer with the same stride size. The output is then fed into a novel channel–spatial attention module (CSAM) to effectively extract features along with channel and spatial dimensions. The output is then passed through a 1 × 1 convolution layer for further refinement. The output feature maps are then concatenated with the input feature maps to process further by performing different kernel size-based convolution layers such as 1 × 1, 3 × 3, 5 × 5, and 7 × 7 to extract more feature information. The outputs are then concatenated again followed by a 1 × 1 convolution layer. The output from the convolution layer is then passed through an MLP layer to extract global feature information. The output of the MLP layer is then concatenated with the output from the channel–spatial attention submodule. The output is then fed into other parts of the model for further processing.

Figure 6 represents the structure of the CSAM which is added to every block of the proposed model. The CSAM can be divided into two submodules such as the spatial attention submodule and channel attention submodule. The feature maps from the previous layer are first passed to CSAM. The channel attention submodule first performs global average pooling on the features map followed by a convolution layer with a 3

\times

3 filter size. The output from the convolution layer is then passed through an MLP layer to extract more valuable features from channel dimensions. The output from the MLP layer is then passed through a 3

\times

3 convolution layer followed by a sigmoid function. The output from the sigmoid function then goes through multiplication with the initial input features.

C A = σ (C o n v_{3} (M L P (C o n v_{3} (A v g P o o l (X)))) .

(12)

The spatial attention submodule first performs Maxpooling on the input features followed by a 1

\times

1 convolution layer. The output from the convolution layer is then passed through a sigmoid function. The output from the sigmoid function is then multiplied by the initial input feature maps to acquire spatial wise attention.

S A = σ (C o n v_{1} (M a x P o o l (X))) .

(13)

The output from the channel and spatial wise attention then undergoes element wise multiplication to acquire the final output of the CSAM.

C S A M = C A ⊙ S A .

(14)

4.2.2. Bayesian Feature Pyramid Network

The FPN plays a crucial role in drawing features from input images, enhancing the feature extractor’s ability to generate comprehensive feature maps. However, detectors based on the traditional FPN architecture often underperform in remote sensing environments because of the intricate nature of the objects, their backgrounds, and variable scales. To address this, our study integrates a Multi-Level Feature Fusion Module (MLFFM) into the lateral connections of the FPN, aiming to minimize information loss. Additionally, we have developed a Bayesian Local and Global Attention Module (BDLAM) that is capable of capturing essential local and global features from the FPN’s outputs. Figure 7 shows the overall structure of the Bayesian FPN, including MLFFM and BDLAM. The acquired feature maps from the MRENet are split into three feature maps and individually perform Bayesian convolution operations with different kernel sizes to extract features with different hierarchies. Our approach captures spatial hierarchies effectively and allows FPN to comprehend complex data relationships across different scales. It expands the receptive field which facilitates a broader understanding of larger input data segments. Furthermore, the use of the multi-kernel method reduces the parameter count compared to using a single large kernel size which increases efficiency and reduces overfitting risks. By learning diverse and complementary features through different kernels, the FPN is able to learn features from different scales and spatial transformations. Pyramidal feature maps are obtained by adding the proposed MLFFM to the inner lateral connection of the FPN. The generated features are then integrated with the proposed attention module to produce enhanced feature maps.

Figure 8 shows the schematic structure and comparison of the proposed BDLAM with other state-of-the-art attention modules used in computer vision tasks. Figure 8a represents the structure of the well-known SE attention module which is composed of two operations, squeeze and excitation [44]. The squeeze operation,

z_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} x_{c} (i, j),

is mainly responsible for allowing global information embedding. The excitation operation,

\hat{X} = X \cdot σ (\hat{z})

, captures channel-wise dependencies. However, the SE attention module only concentrates on channel relationships, inadvertently neglecting crucial positional information. Later, FcaNet was proposed to improve the global information embedding by introducing discrete cosine transform (DCT) to the place of the traditional global average pooling method, which is represented in Figure 8b [45]. Although, in FcaNet,

s = F_{fca} (X, θ) = σ (W_{2} δ (W_{1} [(DCT (Group (X)))]))

shows great improvement compared to the traditional global average pooling method, it has a limitation in handling the unique positional data and complex backgrounds found in remote sensing images. Coordinate attention, which can be seen in Figure 8c, tries to solve the problem found in traditional SE-based attention modules by preserving the positional information through a pair of 1D feature-encoding operations,

z_{c}^{h} (h) = \frac{1}{W} \sum_{i = 0}^{W - 1} x c (h, i) and z_{c}^{w} (w) = \frac{1}{H} \sum_{j = 0}^{H - 1} x c (j, w)

[46]. While this approach enables the capture of long-range spatial interactions with precise positional information, it has limitations in accurately localizing objects of interest within complex backgrounds and intricate spatial structures.

The proposed BDLAM solves these problems by introducing local and global attention blocks to process rich feature maps simultaneously, thus ensuring detailed feature extraction without feature loss and capturing broader patterns and orientations. The introduction of the involution layer in place of the convolution layer also reduces computation costs and improves training stability. Moreover, the Discrete Fourier Transform (DFT) allows for the preservation of more essential patterns from both horizontal and vertical coordinates because of the access to both higher and lower frequencies of the input, which is very crucial for remote sensing images.

The BDLAM uses a local attention block to refine the regional characteristics of remote sensing images. This process begins with a 1 × 1 involution layer that simplifies the input, which is then enhanced through batch normalization and the LeakyReLU activation function to ensure computational efficiency and training stability without loss of features. Further processing includes a 3 × 3 involution layer and additional batch normalization and activation, followed by dimension reduction back to the original size through another 1 × 1 involution. Finally, after a last round of batch normalization and a fully connected layer, the refined features are relayed to the global attention block via a sigmoid function. This structure effectively balances computational demands with feature preservation and enhancement.

s_{1} = F_{F M_{l o c a l}} (X, θ) = σ (B I n v_{3} (δ (B N (B I n v_{2} (δ (B N (B I n v_{1} (X))))))))

(15)

The global attention block of the BDLAM enhances feature detection by compensating for the limitations of the local attention block, particularly in processing remote sensing images with complex backgrounds and varied object orientations. It starts by splitting the input features into horizontal and vertical coordinates, which are processed through a two-dimensional DFT to retain essential patterns and compress information effectively. The results are refined through involution layers and combined. Finally, after batch normalization and activation using LeakyReLU, the features are segmented and subjected to sigmoid activation before being integrated with the local attention block outputs for enriched feature representation.

𝓏^{h} = B I n v_{1}^{1 \times 1} (D F T^{h} (X)), 𝓏^{w} = B I n v_{2}^{1 \times 1} (D F T^{w} (X)), f = δ (B N (B I n v_{3}^{1 \times 1} ([𝓏^{h}; 𝓏^{w}])))), f^{h}, f^{w} = S p l i t (f), s^{h} = σ (B I n v_{4}^{1 \times 1} (f^{h})), s^{w} = σ (B I n v_{5}^{1 \times 1} (f^{w})), s_{2} = F_{F M_{g l o b a l}} (X, θ) = (s^{h} + 𝓏^{h}) \times (s^{w} + 𝓏^{w}) .

(16)

The relevant aggregation equation can be seen as follows:

F M_{o u t p u t} = s_{1} + s_{2},

(17)

where

F M_{l o c a l}

is the output from the local attention block and

F M_{g l o b a l}

is the output from the global attention block.

4.3. Region Proposal Network

The features obtained from the feature extractor are processed by the region proposal network, which employs end-to-end training to identify potential objects and their corresponding bounding boxes in an image. This network traditionally utilizes standard convolutional layers to differentiate between objects and the background and to conduct bounding box regression. In a modification to this approach, both the classification and regression functionalities of the RPN have been substituted with Bayesian convolutional layers, as depicted in Figure 9.

4.4. Final Prediction Layer

The final prediction layer is responsible for layer object classification and final box regression proposal using a fully connected layer. The original fully connected layer is replaced with the Bayesian linear layer to enable Bayesian inference to the model. Normally, cross-entropy loss is used in the prediction layer for object classification, but the proposed Bayesian inference requires an evidence lower bound (ELBO) as the loss function. Moreover, it is found that focal loss performs better in complex object detection scenarios. Therefore, the focal loss function and an ELBO are integrated to calculate the model’s object classification loss. The focal loss function can be defined as follows:

F L (p_{t}) = {(1 - p_{t})}^{γ} \log (p_{t}),

(18)

where

γ

is the tunable focusing factor. The ELBO function for the proposed model’s classification can be defined as follows:

E L B O = F L (p_{t}) + β \times K L,

(19)

where

β

is the Lagrangian multiplier,

F L (p_{t})

is the focal loss, and KL is the Kullback–Leibler divergence.

β

is calculated using the Soenderby approach as follows:

β = \min (e p o c h (n u m_e p o c h / 4), 1) .

(20)

The final loss function of the Bayes R-CNN can be defined as follows:

\begin{matrix} L (\{p_{i}\}, \{t_{i}\}) = \frac{1}{N_{c l s}} \sum_{i} E L B O \\ + λ \frac{1}{N_{reg}} \sum_{i} p_{i}^{*} L_{reg} (t_{i}, t_{i}^{*}) . \end{matrix}

(21)

5. Image Super-Resolution Method

This study introduced the modified lightweight enhanced super-resolution CNN (LESRCNN) to improve the quality of the remote sensing images. Figure 10 shows the overall structure of the Bayesian LESRCNN based on LESRNCNN [47], which consists of three blocks, including the information extraction and enhancement block (IEEB), reconstruction block (RB), and information refinement block (IRB). The CSAM attention module is added to the IEEB to extract more important features to improve the information extraction from the low-resolution images. Moreover, the original convolution layers are replaced with the Bayesian convolution layers. The RB block is responsible for extracting the global and local features and transforming those features into high-frequency features. This is accomplished through residual learning and the sub-pixel convolution technique. The features from the RB are then passed through the IRB to acquire and learn super-resolution features from the low-resolution features. The IRB is also responsible for constructing the super-resolution images as the output. The super-resolution technique procedure, according to [47], can be formulated as below:

O_{S R} = f_{I R B} (f_{R B} (f_{I E E B} (I_{L R}))),

(22)

The procedure of the RB can be formulated as below:

O_{R B} = R (S (O_{1})) + S (O_{17})),

(23)

where S represents the sub-pixel convolution operations and

O_{1}

and

O_{17}

represent the global and local features of the low-resolution images. The IRB process can be formulated as below:

O_{S R} = B C_{3} (R (B C_{3} (R (B C_{3} (R (B C_{3} (R (B C_{3} (O_{R B}))))))))),

(24)

where

B C_{3}

is the 3

\times

3 Bayesian convolution layer, R is the ReLU activation function, and

O_{R B}

is the output from the RB.

6. Experimental Results and Analysis

This section explores different experimental methods and metrics to demonstrate the robustness of the proposed Bayesian object detection model. We first described the experimental platform and relevant parameters to reproduce the acquired results presented in this section. Different evaluation metrics which are commonly used for detection are described in this section. The training and testing analysis of the proposed model is also presented in this section. Then, the evaluation of the prediction through uncertainty estimation is performed to demonstrate the uncertainty estimation capability of the proposed model. Lastly, the object detection prediction results and background retrieval information are presented to demonstrate the Bayesian objection detection future potentials.

6.1. Experimental Platforms and Parameter

The training and testing analysis was performed in the ubuntu operating system, which has an AMD Ryzen 5 3500X processor and the NVIDIA RTX 3080 GPU. Python 3.7 with the PyTorch deep learning environment library was used to run the relevant code.

6.2. Bayesian Convolutional Neural Network

In this study, metrics such as precision, recall, and mean Average Precision (mAP) were applied to detection tasks, while accuracy, recall, precision, and F1-score were used for background classification. Validation and testing involve analyzing true positives (TP), false positives (FP), and false negatives (FN). The relevant equations can be seen below:

m A P = \frac{\sum_{i = 1}^{n} A P_{i}}{n} P r e c i s i o n (P) = \frac{T P}{T P + F P}, R e c a l l (R) = \frac{T P}{T P + F N}, S p e c i f i t y (S) = \frac{T N}{T N + F P}, F - 1 s c o r e (F) = 2 \times \frac{R \times P}{R + P}, A c c u r a c y (A) = \frac{T P + T N}{T P + T N + F P + F N}

(25)

6.3. Bayes R-CNN Result Analysis

The proposed object detection model was trained using different backbones such as ResNet50, ResNext50, Wide ResNet50, ShuffleNet V2, MobileNet V3-L, ResNet101, RegNet, and MRENet. We then calculated mAP to evaluate the performance of the proposed model. The same hyperparameter was used for all backbones throughout the training process. The learning rate was set to 0.01, and the gamma (γ) of the learning rate scheduler was set to 0.2. Table 2 shows the acquired results of the proposed model for all backbones. It can be seen from Table 2 that the proposed model with ShuffleNet V2 2× performed worse compared to the other backbones. On the contrary, the proposed model with MRENet outperforms other models across all evaluation metrics. Therefore, the MRENet backbone was used to perform hyperparameter optimization to acquire the best result.

Table 3 compares the mAP values obtained with our proposed Bayes R-CNN architecture when employing different backbone networks. The Bayes R-CNN obtains an mAP of 86.87 using the ResNet50 backbone. When the ResNext50 backbone is used, the mAP increases marginally to 87.79. When the Wide ResNet50 backbone is used, the accuracy rises to 88.26. When the ShuffleNet V2 and MobileNet V3-L backbones are employed, there is a substantial drop in performance, with mAP ratings of 80.94 and 83.65, respectively. The adoption of the ResNet101 and RegNet backbones results in significant performance increases, with mAP values of 87.51 and 89.08, respectively. Finally, the Bayes R-CNN framework outperforms all other backbones with the greatest mAP of 91.23 when using the MRENet backbone.

The learning rate plays an important role in achieving the best result in the object detection system. A different gamma rate from the learning rate scheduler with the same learning rate was used to train the proposed model. This study used gamma values such as 0.3, 0.2, and 0.1 with a learning rate of 0.01 to train the proposed model. Table 4 shows the hyperparameter’s effect on the proposed model’s performance. It can be seen from Table 4 that the proposed model with a 0.1 gamma rate acquired the best result achieving an mAP of 74.63 on the DIOR dataset and 91.23 on the HRSC2016 dataset with a 0.2 gamma rate.

Table 5 presents an ablation study for the proposed object detection model, evaluating various configurations of MRENet, BDLAM, and MLFFM submodules. This study assesses the performance of each combination in terms of mAP across the DIOR and HRSC2016 datasets.

MRENet achieved an mAP of 72.67% when evaluated separately on the DIOR dataset, MLFFM had an mAP of 71.21%, and BDLAM had an mAP of 72.08%. Meanwhile, MRENet, MLFFM, and BDLAM obtained mAPs of 90.16%, 90.02%, and 90.37%, respectively, on the HRSC2016 dataset.

The mAP was increased to 73.26% on DIOR and 90.91% on HRSC2016 when MRENet and MLFFM were added together. On DIOR and HRSC2016, combining MRENet and BDLAM yielded mAPs of 73.58% and 91.04%, respectively, whereas combining MLFFM and BDLAM produced mAPs of 73.05% and 90.78%.

The best results were obtained when all three modules (MRENet, MLFFM, and BDLAM) were used simultaneously, resulting in mAPs of 74.63% on DIOR and 91.23% on HRSC2016. These findings imply that combining these modules within a Bayes R-CNN improves the overall performance of the Bayes R-CNN.

Figure 11 shows a visual comparison of the feature map visualizations derived from three different methods: the traditional FPN, MLFFM, and the combined MLFFM + BDLAM approach. In the feature map produced using the traditional FPN, there are noticeable inaccuracies in feature extraction. This is manifested through both false negatives and missing region-of-interest features, leading to an imprecise representation of the essential features within the image. Transitioning to the MLFFM method, there is a visible improvement in the extraction process. The feature map under this method is more accurate, capturing the relevant features with greater precision. However, it is crucial to note that while missing region-of-interest features are mitigated, there are minor occurrences of false negatives, indicating slight omissions in the feature extraction process. In contrast, the feature map generated with the integration of MLFFM and BDLAM presents a significant enhancement in accuracy. The MLFFM + BDLAM method not only accurately extracts features but also meticulously emphasizes the most crucial and relevant ones. The BDLAM component plays a pivotal role in this refinement, directing the model’s attention effectively to focus on and highlight the features of paramount importance.

Figure 12 shows the visualization of the proposed Bayes R-CNN model’s object detection performance on the DIOR dataset under different background scenarios. The presented results clearly demonstrate the model’s resilience to different and complicated backgrounds that are frequently encountered in real-world circumstances. The Bayes R-CNN has clearly demonstrated improved object detection and localization capabilities, accurately separating regions of interest against complex backgrounds. The model’s ability to retain high accuracy in a variety of environments demonstrates its versatility and durability, both of which are critical for object detection in a state-of-the-art dataset like DIOR. As a result, this image emphasizes the Bayes R-CNN model’s ability to handle complex object detection scenarios.

Figure 13 shows a visual illustration of the capabilities of the Bayes R-CNN model applied to the HRSC2016 dataset under various background scenarios. These findings highlight the model’s ability to handle a wide range of backgrounds, which is crucial for real-world object detection tasks. Objects are efficiently detected and located using the Bayes R-CNN model, even in potentially complicated backgrounds. The model’s consistent performance across various background scenarios demonstrates its adaptability and resilience, both of which are required for efficient object detection on the HRSC2016 dataset.

Figure 14 presents a visual exploration of the performance metrics across various object detection models, comparing their parameter sizes with their mAP percentages. The scatter plot reveals a diverse spectrum of parameter sizes, from a compact 31 M in models like LSKNet-S to a larger 55.1 M in models such as RoI-T. Notably, many models consistently remain around the 90% mAP mark, suggesting that performance remains relatively stable across different model complexities. For instance, the Bayes-RCNN, with its 54.7M parameter size, stands out by achieving a 91.2% mAP. This is especially remarkable when compared to models like RoI-T, which, despite its slightly larger parameter size, only manages an mAP of 86.2%. These data underscore Bayes-RCNN’s optimal balance between efficiency and performance, making it a compelling choice for object detection tasks.

6.4. Image Super-Resolution on Remote Sensing Image Result Analysis

The image super-resolution technique can play a significant role in improving the quality of the remote sensing images with low resolution. Therefore, this study proposed a Bayesian LESRCNN to improve the quality of the DIOR dataset. The model was trained for 3000 epochs, and the batch size was set to 64 with a learning rate of 0.01. The proposed model achieved mean-psnr and mean-ssim of 32.56 and 0.9236 for the urban-100 dataset. The proposed model surpassed the accuracy of the baseline model for 2× scaling, which can be seen in Table 6.

The Bayes R-CNN was further evaluated to compare the effect of integrating the image super-resolution technique prior to the prediction step. Figure 15 shows the detection results extracted from three intricate scenarios, with outcomes depicted both prior and subsequent to the implementation of the image super-resolution technique.

The inference from Figure 15 indicates a noticeable impact of the super-resolution technique on our model’s performance. An assessment of the results without the super-resolution technique shows a slight decrease in prediction accuracy. This decrement can be attributed to the lower resolution images, which potentially lack the essential granularity needed for the model to effectively distinguish and detect objects.

Conversely, following the introduction of the image super-resolution technique, there is a significant increase in prediction accuracy. The super-resolution technique enhances the resolution of input images, thereby providing more detailed information. The richer detail within these images improves the model’s ability to distinguish between different objects, consequently reducing false positives and negatives.

Specifically, the first image from Figure 15b shows an increased number of correctly detected objects after applying the super-resolution technique, indicating a heightened recall rate. On the other hand, the second image exhibits a reduction in the false-positive rate following the implementation of the image super-resolution technique. This reduction demonstrates a marked improvement in precision. The third scenario illustrates the impact of superior image quality on object localization. The enhanced resolution of the images allows for more precise positioning of bounding boxes around detected objects, thereby improving the precision of detection.

6.5. Uncertainty-Aware Object Detection Result Analysis

Uncertainty estimation plays a vital role in evaluating a model’s prediction on the open set conditions. Uncertainty estimation can be divided into epistemic uncertainty and aleatoric uncertainty for computer vision problems. Epistemic uncertainty indicates the model’s uncertainty for the observable data, which is usually caused by inadequate training datasets. Therefore, the higher epistemic uncertainty directly correlates with the performance of the proposed model for a given image or in open-set conditions. On the other hand, aleatoric uncertainty indicates the uncertainty of a given image, including noise and variances. For further use, aleatoric uncertainty can be used to evaluate any dataset or image. Figure 16 shows the uncertainty estimation of some predicted images. It can be seen that the epistemic uncertainty is lower in those predicted images where the model accurately detected objects in a given image. On the other hand, the epistemic increases in those images where the model prediction score is very low. It can also be seen that the aleatoric uncertainty increases for those images where the acquired image quality is not good due to the inherent noise. The proposed method can be used in real-world open-set conditions where a model can observe data that were not used to train the model. The prediction of those data can be evaluated through uncertainty measurement to make more robust predictions.

6.6. Background Classification

The DIOR and HRSC2016 datasets encompass a variety of remote sensing images, each providing diverse background information, thereby serving as a rich source for data extraction and analysis. These datasets have been classified into four distinctive categories based on their respective background environments: Airport, City, Sea, and Suburb. A total of 2713 images were randomly selected from the datasets and systematically assigned to these four categories.

In our study, we utilized a pre-trained policy from ImageNet for the auto augment technique to enhance the training dataset. The processed images with the auto augment technique can be seen in Figure 17. This technique was applied to introduce diversity and improve the model’s robustness according to the dataset. The impact of this data augmentation is quantitatively demonstrated in Table 7, which compares the overall classification accuracy of our model with and without data augmentation. Our model achieved a classification accuracy of 99.12% with auto augment and achieved 98.45% without any data augmentation technique. It demonstrates the effectiveness of using auto augment in enhancing the model’s performance by providing a more varied and comprehensive training dataset.

Table 8 shows the validation results of MRENet for the different classes. This study used a learning rate of 0.01 in conjunction with the SGD optimizer to train the model. The trained model was subjected to testing, wherein the evaluation metrics for classification were computed to assess the performance of the model. Our model demonstrated exceptional accuracy across all categories, thereby validating its efficacy.

Our proposed background information model was subsequently integrated with the prediction layer of the Bayes R-CNN to generate object detection with background information. Figure 18 illustrates the outcome of retrieving background information in conjunction with the object detection methodology. We selected four images representative of two distinct background classes to evaluate the predictive prowess of the proposed method. As depicted, the method exhibits impressive accuracy in predicting the respective background classes. This innovative approach holds substantial potential for future research, as it facilitates the extraction of an enriched array of information from remote sensing imagery.

7. Discussion

This study utilized two widely recognized state-of-the-art datasets, DIOR and HRSC2016, which are extensively used in the remote sensing community to benchmark object detection models. The DIOR dataset comprises high-resolution images with a diverse set of 20 object categories, providing a broad spectrum of scenarios and environmental conditions. This diversity is crucial for evaluating the robustness and generalizability of object detection models. The HRSC2016 dataset, on the other hand, focuses specifically on high-resolution images of ships which offer detailed annotations and challenging scenarios such as varying scales, orientations, and complex backgrounds. To establish the sufficiency of these datasets for training and testing our model, we compared the performance of our model with other state of the art models.

Table 9 compares the mAP attained with the proposed Bayes R-CNN model to that of existing state-of-the-art models on the DIOR dataset. In this comparison study, the Bayes R-CNN model outperforms all others, obtaining the highest mAP score of 74.6. Other models, including MSF-SNET, DFPN-YOLO, ASSD, and AFPN + GAS, obtained mAP scores of 66.5, 69.3, 71.1, and 73.3, respectively. Even when compared to approaches such as A-MLFFM, ViT-G12X4, and FPN + MSDAM + MLFAM, which produced reasonably high mAP scores in the range of 73.6 to 73.9, the Bayes R-CNN model outperforms them. This result demonstrates the Bayes R-CNN model’s efficacy for object detection tasks in the DIOR dataset.

Table 10 provides a comparative analysis of the mAP performance of the Bayes R-CNN model compared to other state-of-the-art models on the HRSC2016 dataset. The Bayes R-CNN model outperformed the other models, obtaining the highest mAP of 91.21, according to the table. When compared to other models with lesser mAP scores ranging from 78.51 to 88.20, such as TOSO, Gliding Vertex, RSDet, and DAFNe, the Bayes R-CNN shows a considerable improvement. Even when compared to high-performing models like CSL, R3Det, LSKNet-S, and OFCOS, which had mAP scores ranging from 89.62 to 91.07, the Bayes R-CNN outperformed them. These findings highlight the Bayes R-CNN model’s efficacy for object detection tasks on the HRSC2016 dataset.

During the evaluation of our background classification model, we observed several instances of misclassification that provide insight into the challenges faced by the model. Specifically, we identified one airport image that was incorrectly classified as a suburb, two suburb images that were mislabeled as a city, and one suburb image that was mistakenly categorized as an airport, as shown in Figure 19. These errors can be attributed to several factors related to the visual similarities between these classes and the model’s feature extraction process.

The airport image misclassified as a suburb resulted from the presence of vast open spaces near the airport, which confused the model. Similarly, the two suburb images were labeled as a city due to the higher density of buildings and urban-like features present within these suburbs which causes the model to categorize them as city environments. Furthermore, the suburb image misclassified as an airport likely resulted from the presence of airport-like structures that the model associated with airport facilities. This indicates future research can be conducted to refine feature extraction techniques that can better differentiate between such visually similar environments.

Although our proposed method achieved better performance, one notable limitation of our model is its computational complexity. The integration of Bayesian techniques such as BBB and BDLAM increases both the computational load and the time required for training compared to traditional deep learning models. This can be a significant challenge for real-time applications and necessitates access to high-performance computing resources. Additionally, while our model effectively quantifies uncertainty and mitigates overfitting, it still faces challenges in distinguishing objects with highly similar visual features leading to occasional misclassifications. Furthermore, our model’s dependency on high-resolution images to achieve optimal performance presents another limitation. In practical applications, such high-quality data may not always be available due to constraints in sensor capabilities or transmission bandwidth. Although we have incorporated a Bayesian image super-resolution technique to address this issue, ensuring consistent and accurate object detection across various datasets remains a challenge.

Future research could focus on enhancing the computational efficiency of our model, possibly through techniques such as model pruning and more efficient variational inference methods. Additionally, implementing advanced data augmentation strategies and exploring semi-supervised learning approaches could improve the model’s generalization capabilities and reduce its dependency on high-resolution images. We also see significant potential in extending our model to process video sequences in remote sensing, leveraging temporal information to improve detection accuracy and robustness.

8. Conclusions

This paper introduces a Bayes R-CNN, a Bayesian object detection model designed to enhance the accuracy of object detection in a complex dataset. It incorporates the MRENet model as the feature extractor and the MLFFM module in the FPN’s lateral connections to minimize information loss. A novel Bayesian distributed lightweight attention module enhances feature extraction within the FPN, and the traditional ROI pooling is upgraded to multiscale ROI align for more precise bounding box proposals. To address dataset imbalances and estimate uncertainty, the loss function incorporates focal and ELBO loss. Additionally, a Bayesian super-image-resolution technique is employed to enhance low-resolution images, while both epistemic and aleatoric uncertainties are calculated to mitigate errors under open set conditions. Experimental outcomes reveal that Bayes R-CNN surpasses existing models, achieving mAP scores of 74.6% on the DIOR dataset and 91.23% on the HRSC2016 dataset.

Despite its high accuracy, Bayes R-CNN faces challenges with accurately proposing oriented bounding boxes and exhibits greater computational complexity compared to single-stage detectors. In the future, we would like to work on improving the oriented bounding box proposal and reducing the computational cost of our solution.

Author Contributions

Conceptualization, S.A.S.M.S. and J.T.; Data curation, H.S.K.; Formal analysis, Y.C., J.H.C., H.S.K., K.D.K. and S.A.; Investigation, J.H.C., H.S.K. and S.A.; Methodology, S.A.S.M.S. and J.T.; Software, S.A.S.M.S.; Supervision, S.A.; Validation, J.T., Y.C. and K.D.K.; Writing—original draft, S.A.S.M.S. and J.T.; Writing—review and editing, K.D.K. and S.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data that have been used are confidential.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Han, Z.; Wang, C.; Fu, Q. Arbitrary-Oriented Target Detection in Large Scene SAR Images. Defence Technol. 2020, 16, 933–946. [Google Scholar] [CrossRef]
Yu, D.; Ji, S. A New Spatial-Oriented Object Detection Framework for Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4407416. [Google Scholar] [CrossRef]
Tian, Z.; Huang, J.; Yang, Y.; Nie, W. KCFS-Yolov5: A High-Precision Detection Method for Object Detection in Aerial Remote Sensing Images. Appl. Sci. 2023, 13, 649. [Google Scholar] [CrossRef]
Roy, K.; Sinha Chaudhuri, S.; Pramanik, S.; Banerjee, S. Deep Neural Network Based Detection and Segmentation of Ships for Maritime Surveillance. Comput. Syst. Sci. Eng. 2023, 44, 647–662. [Google Scholar] [CrossRef]
Pi, Y.; Nath, N.D.; Behzadan, A.H. Convolutional Neural Networks for Object Detection in Aerial Imagery for Disaster Response and Recovery. Adv. Eng. Inform. 2020, 43, 101009. [Google Scholar] [CrossRef]
Reddy, T.; RM, S.P.; Parimala, M.; Chowdhary, C.L.; Hakak, S.; Khan, W.Z. A Deep Neural Networks Based Model for Uninterrupted Marine Environment Monitoring. Comput. Commun. 2020, 157, 64–75. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object Detection via Region-Based Fully Convolutional Networks. Adv. Neural Inf. Process Syst. 2016, 29, 379–387. [Google Scholar]
Liu, W.; Anguelova, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot Multibox Detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE onference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Ding, J.; Xue, N.; Long, Y.; Xia, G.-S.; Lu, Q. Learning ROI Transformer for Oriented Object Detection in Aerial Images. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for Object Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
Zhang, G.; Lu, S.; Zhang, W. CAD-NET: A Context-Aware Detection Network for Objects in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2019, 57, 10015–10024. [Google Scholar] [CrossRef]
Chen, S.; Zhao, J.; Zhou, Y.; Wang, H.; Yao, R.; Zhang, L.; Xue, Y. Info-FPN: An Informative Feature Pyramid Network for Object Detection in Remote Sensing Images. Expert Syst. Appl. 2023, 214, 119132. [Google Scholar] [CrossRef]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object Detection in Optical Remote Sensing Images: A Survey and a New Benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Huyan, L.; Bai, Y.; Li, Y.; Jiang, D.; Zhang, Y.; Zhou, Q.; Cui, T. A Lightweight Object Detection Framework for Remote Sensing Images. Remote Sens. 2021, 13, 683. [Google Scholar] [CrossRef]
Huang, W.; Li, G.; Chen, Q.; Ju, M.; Qu, J. CF2PN: A Cross-Scale Feature Fusion Pyramid Network Based Remote Sensing Target Detection. Remote Sens. 2021, 13, 847. [Google Scholar] [CrossRef]
Sun, Y.; Liu, W.; Gao, Y.; Hou, X.; Bi, F. A Dense Feature Pyramid Network for Remote Sensing Object Detection. Appl. Sci. 2022, 12, 4997. [Google Scholar] [CrossRef]
Liu, Y.; Li, Q.; Yuan, Y.; Wang, Q. Single-Shot Balanced Detector for Geospatial Object Detection. In Proceedings of the ICASSP 2022–2022 International Conference Acoustics, Speech Signal Process (ICASSP), Singapore, 22–27 May 2022. [Google Scholar]
Yuan, Z.; Liu, Z.; Zhu, C.; Qi, J.; Zhao, D. Object Detection in Remote Sensing Images via Multi-Feature Pyramid Network with Receptive Field Block. Remote Sens. 2021, 13, 862. [Google Scholar] [CrossRef]
Xu, T.; Sun, X.; Diao, W.; Zhao, L.; Fu, K.; Wang, H. ASSD: Feature Aligned Single-Shot Detection for Multiscale Objects in Aerial Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5607117. [Google Scholar] [CrossRef]
Wang, J.; Wang, Y.; Wu, Y.; Zhang, K.; Wang, Q. FRPNet: A Feature-Reflowing Pyramid Network for Object Detection of Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8004405. [Google Scholar] [CrossRef]
Wang, G.; Zhuang, Y.; Chen, H.; Liu, X.; Zhang, T.; Li, L.; Sang, Q. FSOD-net: Full-Scale Object Detection from Optical Remote Sensing Imagery. IEEE Trans. Geosci. Sens. 2022, 60, 5602918. [Google Scholar] [CrossRef]
Cheng, G.; He, M.; Hong, H.; Yao, X.; Qian, X.; Guo, L. Guiding Clean Features for Object Detection in Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8019205. [Google Scholar] [CrossRef]
Dong, X.; Qin, Y.; Gao, Y.; Fu, R.; Liu, S.; Ye, Y. Attention-Based Multi-Level Feature Fusion for Object Detection in Remote Sensing Images. Remote Sens. 2022, 14, 3735. [Google Scholar] [CrossRef]
Xiao, Z.; Qian, L.; Shao, W.; Tan, X.; Wang, K. Axis Learning for Orientated Objects Detection in Aerial Images. Remote Sens. 2020, 12, 908. [Google Scholar] [CrossRef]
Chen, Z.; Chen, K.; Lin, W.; See, J.; Yu, H.; Ke, Y.; Yang, C. Piou Loss: Towards Accurate Oriented Object Detection in Complex Environments. In Proceedings of the Computer Vision—ECCV 2020 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 195–211. [Google Scholar]
Xiao, J.; Yao, Y.; Zhou, J.; Guo, H.; Yu, Q.; Wang, Y.-F. FDLR-Net: A Feature Decoupling and Localization Refinement Network for Object Detection in Remote Sensing Images. Expert Syst. Appl. 2023, 225, 120068. [Google Scholar] [CrossRef]
Lang, S.; Ventola, F.; Kersting, K. DAFNe: A One-Stage Anchor-Free Approach for Oriented Object Detection. arXiv 2021, arXiv:2109.06148. [Google Scholar]
Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Chen, K. Rtmdet: An Empirical Study of Designing Real-Time Object Detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar]
Wang, D.; Zhang, J.; Du, B.; Xia, G.-S.; Tao, D. An Empirical Study of Remote Sensing Pretraining. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5608020. [Google Scholar] [CrossRef]
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.M.; Yang, J.; Li, X. Large Selective Kernel Network for Remote Sensing Object Detection. arXiv 2023, arXiv:2303.09030. [Google Scholar]
Li, Y.; Huang, Q.; Pei, X.; Jiao, L.; Shang, R. Radet: Refine Feature Pyramid Network and Multi-Layer Attention Network for Arbitrary-Oriented Object Detection of Remote Sensing Images. Remote Sens. 2020, 12, 389. [Google Scholar] [CrossRef]
Zhang, W.; Jiao, L.; Liu, X.; Liu, J. Multi-Scale Feature Fusion Network for Object Detection in VHR Optical Remote Sensing Images. In Proceedings of the 2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019. [Google Scholar]
Ghiasi, G.; Lin, T.-Y.; Le, Q.V. Nas-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Guo, C.; Fan, B.; Zhang, Q.; Xiang, S.; Pan, C. AugFPN: Improving Multi-Scale Feature Learning for Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Vo, X.-T.; Jo, K.-H. Enhanced Feature Pyramid Networks by Feature Aggregation Module and Refinement Module. In Proceedings of the 2020 13th International Conference on Human System Interaction (HSI), Tokyo, Japan, 6–8 June 2020. [Google Scholar]
Liu, E.; Zheng, Y.; Pan, B.; Xu, X.; Shi, Z. DCL-net: Augmenting the Capability of Classification and Localization for Remote Sensing Object Detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 7933–7944. [Google Scholar] [CrossRef]
Shi, L.; Kuang, L.; Xu, X.; Pan, B.; Shi, Z. Canet: Centerness-Aware Network for Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5603613. [Google Scholar] [CrossRef]
Luo, Y.; Cao, X.; Zhang, J.; Guo, J.; Shen, H.; Wang, T.; Feng, Q. CE-FPN: Enhancing Channel Information for Object Detection. Multimed. Tools Appl. 2022, 81, 30685–30704. [Google Scholar] [CrossRef]
Liu, N.; Celik, T.; Li, H.-C. Gated Ladder-Shaped Feature Pyramid Network for Object Detection in Optical Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6001505. [Google Scholar] [CrossRef]
Shridhar, K.; Laumann, F.; Liwicki, M. A Comprehensive Guide to Bayesian Convolutional Neural Network with Variational Inference. arXiv 2019, arXiv:1901.02731. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar] [CrossRef]
Qin, Z.; Zhang, P.; Wu, F.; Li, X. FCANET: Frequency Channel Attention Networks. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar] [CrossRef]
Tian, C.; Zhuge, R.; Wu, Z.; Xu, Y.; Zuo, W.; Chen, C.; Lin, C.-W. Lightweight Image Super-Resolution with Enhanced CNN. Knowl.-Based Syst. 2020, 205, 106235. [Google Scholar] [CrossRef]
Zhang, T.; Zhuang, Y.; Wang, G.; Dong, S.; Chen, H.; Li, L. Multiscale Semantic Fusion-Guided Fractal Convolutional Object Detection Network for Optical Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5608720. [Google Scholar] [CrossRef]
Yang, Y.; Sun, X.; Diao, W.; Li, H.; Wu, Y.; Li, X.; Fu, K. Adaptive Knowledge Distillation for Lightweight Remote Sensing Object Detectors Optimizing. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5623715. [Google Scholar] [CrossRef]
Cha, K.; Seo, J.; Lee, T. A Billion-Scale Foundation Model for Remote Sensing Images. arXiv 2023, arXiv:2304.05215. [Google Scholar] [CrossRef]
Dong, X.; Qin, Y.; Fu, R.; Gao, Y.; Liu, S.; Ye, Y.; Li, B. Multiscale Deformable Attention and Multilevel Features Aggregation for Remote Sensing Object Detection. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6510405. [Google Scholar] [CrossRef]
Yao, X.; Shen, H.; Feng, X.; Cheng, G.; Han, J. R2IPoints: Pursuing Rotation-Insensitive Point Representation for Aerial Object Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5623512. [Google Scholar] [CrossRef]
Yang, X.; Yan, J.; Liao, W.; Yang, X.; Tang, J.; He, T. Scrdet++: Detecting Small, Cluttered and Rotated Objects via Instance-Level Feature Denoising and Rotation Loss Smoothing. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 2384–2399. [Google Scholar] [CrossRef]
Yang, X.; Yan, J.; Feng, Z.; He, T. R3det: Refined Single-Stage Detector with Feature Refinement for Rotating Object. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 3163–3171. [Google Scholar]
Qian, W.; Yang, X.; Peng, S.; Yan, J.; Guo, Y. Learning Modulated Loss for Rotated Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 2458–2466. [Google Scholar]
Yang, X.; Yan, J. Arbitrary-Oriented Object Detection with Circular Smooth Label. In Proceedings of the Computer Vision—ECCV 2020 16th European Conference, Glasgow, UK, August 23–28 2020; pp. 677–694. [Google Scholar]
Xu, Y.; Fu, M.; Wang, Q.; Wang, Y.; Chen, K.; Xia, G.-S.; Bai, X. Gliding Vertex on the Horizontal Bounding Box for Multi-Oriented Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1452–1459. [Google Scholar] [CrossRef]
Feng, P.; Lin, Y.; Guan, J.; He, G.; Shi, H.; Chambers, J. Toso: Student’s-T Distribution Aided One-Stage Orientation Target Detection in Remote Sensing Images. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 4–8 May 2020. [Google Scholar]
Ming, Q.; Zhou, Z.; Miao, L.; Zhang, H.; Li, L. Dynamic Anchor Learning for Arbitrary-Oriented Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 2355–2363. [Google Scholar]
Lin, Y.; Feng, P.; Guan, J. IENet: Interacting Embranchment One Stage Anchor Free Detector for Orientation Aerial Object Detection. arXiv 2019, arXiv:1912.00969. [Google Scholar]
Zhang, D.; Wang, C.; Fu, Q. OFCOS: An Oriented Anchor-Free Detector for Ship Detection in Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 6004005. [Google Scholar] [CrossRef]

Figure 1. The architecture of the Uncertainty-Aware Bayesian Object Detection Model, encompassing dataset collection, classification augmentation, Bayes R-CNN, MRENet, Bayesian image super-resolution technique, and output showing object detection with background analysis and uncertainty estimation.

Figure 2. Architectural overview of MRENet, which is divided into stem, body, and head sections. (a) Complete MRENet network, (b) detailed view of the body, and (c) close-up of an individual block.

Figure 3. Comparative structure of the base stem module and the DHFL stem module. The DHFL stem module is hierarchically designed to process input data across multiple scales, resulting in a unified output feature map.

Figure 4. Grad-CAM visualizations illustrating the effectiveness of the DHFL stem module relative to a traditional approach without DHFL integration.

Figure 5. Structure of the individual block, comprising a convolution layer, CSAM layer, and MLP layer, designed to efficiently process the input feature map.

Figure 6. Architecture of the CSAM, integrated into each individual block to conduct attention operations across both channel and spatial dimensions.

Figure 7. Overview of the Bayesian FPN featuring the proposed MLFFM and BDLAM modules.

Figure 8. Comparative overview of different attention modules along with our proposed BDLAM. (a) represents the squeeze and excitation module which is a channel attention module; (b) represents the FCA Net which is the improved version of Squeeze and excitation module; (c) represents the Coordinate attention module and (d) represents our proposed BDLAM which proposed both local and global attention to improve the feature extraction effectively.

Figure 9. Architecture of the RPN within our Bayes R-CNN framework, employing Bayesian convolution layers to enhance the model’s robustness in identifying regions of interest.

Figure 10. The overall structure of the Bayesian LESRCNN to improve the quality of the low-resolution remote sensing image. The Bayesian LESRCNN consists of IEEB, RB, and IRB parts, where the CSAM attention module is introduced at the IEEB part.

Figure 11. Comparative feature map visualizations—The traditional FPN exhibits extraction inaccuracies, MLFFM refines with minor omissions, whereas MLFFM + BDLAM captures and emphasizes intricate details with precision.

Figure 12. Object detection results obtained using Bayes R-CNN on the DIOR dataset.

Figure 13. Object detection results achieved with Bayes R-CNN on the HRSC2016 dataset.

Figure 14. Comparison of parameter sizes and mAP values across various state-of-the-art (SOTA) models on the HRSC2016 dataset.

Figure 15. The improvement in object detection accuracy observed before and after integrating the image super-resolution model is significant.

Figure 16. The proposed model’s demonstration of uncertainty estimation across various remote sensing images, calculating both aleatoric and epistemic uncertainty to assess performance.

Figure 17. Original images (left) and their augmented versions (right) using a pretrained ImageNet auto augment policy to enhance model generalization.

Figure 18. Object detection and background classification for various remote sensing images are depicted. The system autonomously identifies objects and categorizes backgrounds within the remote sensing images.

Figure 19. Misclassification examples from our background classification model.

Table 1. The statement of abbreviation in the study.

Abbreviation	Full Term
AI	Artificial Intelligence
BBB	Bayes by Backpropagation
BDLAM	Bayesian Distributed Lightweight Attention Module
CNN	Convolutional Neural Network
CSAM	Channel–Spatial Attention Module
DIOR	Dataset for Object Detection in Remote Sensing
FPN	Feature Pyramid Network
HRSC2016	High-Resolution Ship Collection 2016
MRENet	Multi-Resolution Extraction Network
MLFFM	Multi-Level Feature Fusion Module
R-CNN	Region-based Convolutional Neural Network
RPN	Region Proposal Network
YOLO	You Only Look Once

Table 2. Performance evaluation of Bayes R-CNN on the DIOR dataset using various backbone architectures. The models are trained with different backbones, and the optimal model is chosen based on their performance assessments.

Framework	Backbone	mAP_0.5
Bayes R-CNN (our)	ResNet50	68.59
	ResNext50	67.85
	Wide ResNet50	69.18
	ShuffleNet V2	63.78
	MobileNet V3-L	66.18
	ResNet101	70.19
	RegNet	71.77
	MRENet	73.91

Table 3. Performance assessment of the object detection model on the HRSC2016 dataset, indicating the mAP scores. The model is trained with different backbone architectures, and the optimal model is determined based on their performance evaluations.

Framework	Backbone	mAP
Bayes R-CNN (Our)	ResNet50	86.87
	ResNext50	87.79
	Wide ResNet50	88.26
	ShuffleNet V2	80.94
	MobileNet V3-L	83.65
	ResNet101	87.51
	RegNet	89.08
	MRENet	91.23

Table 4. Comparative analysis of hyperparameter effects on the performance of the proposed model. The parameter γ set to 0.2 yields optimal results based on our experimentation.

γ	DIOR	HRSC2016
0.3	73.50	90.83
0.2	73.91	91.23
0.1	74.63	91.15

Table 5. Ablation analysis of Bayes R-CNN conducted on DIOR and HRSC2016 datasets. The ✓ mark indicates particular module was used during the experiment.

Module	Submodules			mAP (DIOR)	mAP (HRSC2016)
Module	MRENet	MLFFM	BDLAM	mAP (DIOR)	mAP (HRSC2016)
Bayes R-CNN	✓	-	-	72.67	90.16
	-	✓	-	71.21	90.02
	-	-	✓	72.08	90.37
	✓	✓	-	73.26	90.91
	✓	-	✓	73.58	91.04
	-	✓	✓	73.05	90.78
	✓	✓	✓	74.63	91.23

Table 6. Ablation study of Bayes R-CNN on DIOR and HRSC2016 datasets.

Method	Baseline	Our	Improvement
Mean PSNR	31.45	31.66	0.21
Mean SSIM	0.9206	0.9236	0.003

Table 7. The overall accuracy of our background classification model with and without data augmentation method.

Method	Accuracy
Without autoaugment	98.45
With autoaugment	99.12

Table 8. Evaluation metrics for the proposed background classification model, including precision, recall, specificity, F1-score, and accuracy, providing a comprehensive multiclass assessment.

Class	Precision	Recall	Specificity	F1-Score	Accuracy
Airport	99.5	99.5	98.5	99.2	99.3
City	99.8	99.0	99.3	99.4	99.2
Sea	99.2	98.8	99.0	99.0	98.9
Suburb	99.0	99.5	99.2	99.3	99.1

Table 9. Comparison of mAP between the proposed model and the SOTA model on the DIOR dataset.

Reference	Method	C1	C2	C3	C4	C5	C6	C7	C8	C9	C10	C11	C12	C13	C14	C15	C16	C17	C18	C19	C20	mAP
[16]	MSF-SNET	90.3	76.6	90.9	69.6	37.5	88.3	70.6	70.8	63.6	69.9	61.9	59.0	57.5	20.5	90.6	72.4	80.9	60.3	39.8	58.6	66.5
[17]	CF2PN	78.3	78.3	76.5	88.4	37.0	70.9	59.9	71.2	51.1	75.5	77.1	56.7	58.6	76.0	70.6	55.5	88.8	50.8	36.9	86.3	67.2
[18]	DFPN-YOLO	80.2	76.8	72.7	89.1	43.4	76.9	72.3	59.8	56.4	74.3	71.6	63.1	58.7	81.5	40.1	74.2	85.8	73.6	49.7	86.5	69.3
[20]	MFPNet	76.6	83.4	80.6	82.1	44.3	75.6	68.5	85.9	63.9	77.3	77.2	62.1	58.8	77.2	76.8	60.3	86.4	64.5	41.5	80.2	71.2
[21]	ASSD	85.6	82.4	75.8	89.5	40.7	77.6	64.7	67.1	61.7	80.8	78.6	62.0	58.0	84.9	76.7	65.3	87.9	62.4	44.5	76.3	71.1
[22]	FRPNet	64.5	82.6	77.7	81.7	47.1	69.6	50.6	80.0	71.7	81.3	77.4	78.7	82.4	62.9	72.6	67.6	81.2	65.2	52.7	89.1	71.8
[24]	AFPN + GAS	62.8	86.5	74.8	89.2	49.2	76.6	72.5	85.7	75.1	81.3	83.3	60.2	62.7	72.7	77.3	61.9	88.0	69.9	47.0	89.7	73.3
[25]	A-MLFFM	70.9	83.1	71.9	86.5	49.3	78.2	70.3	83.7	76.7	76.0	80.2	55.9	62.7	89.0	71.3	79.1	81.4	60.1	55.6	89.4	73.6
[27]	FSoD-Net	88.9	66.9	86.8	90.2	45.5	79.6	48.2	86.9	75.5	67.0	77.3	53.6	59.7	78.3	69.9	75.0	91.4	52.3	52.0	90.6	71.8
[48]	MSFC-Net	85.8	76.2	74.4	90.1	44.1	78.1	55.5	60.9	59.5	76.9	73.7	49.5	57.2	89.6	69.2	76.5	86.7	51.8	55.2	84.3	70.0
[49]	ARSD	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	70.1
[50]	ViT-G12X4	81.4	61.7	81.1	89.8	54.1	80.8	40.3	79.4	89.0	79.3	84.5	55.8	65.6	89.5	86.1	71.5	90.1	66.2	51.8	73.6	73.6
[51]	FPN + MSDAM + MLFAM	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	73.9
[52]	R²IPoints	67.9	85.9	76.2	85.4	51.5	78.0	73.1	85.1	72.5	81.0	77.3	57.3	62.4	71.8	69.7	67.7	81.4	68.4	52.2	89.0	72.7
[53]	SCRDet++	66.3	83.3	74.3	78.3	52.4	77.9	70.0	84.2	77.9	80.7	81.2	56.7	63.7	73.2	71.9	71.2	83.4	62.2	55.6	90.0	73.2
Our	Bayes R-CNN	93.6	73.5	93.5	87.4	47.2	89.9	59.0	68.1	59.2	83.2	83.9	57.6	62.2	68.4	92.4	81.5	90.7	55.8	71.5	73.4	74.6

Table 10. Comparison of mAP between the proposed model and SOTA model on the HRSC2016 dataset.

Reference	Method	mAP
[26]	Axis Learning	78.51
[27]	CenterNet-OBB + PIoU	89.20
[28]	DAFNe	87.76
[30]	RTMDet-R tiny	90.60
[31]	RSP-ViTAEv2-S-FPN-ORCN	90.40
[32]	LSKNet-S	90.65
[39]	DCL	89.46
[54]	R³Det	89.26
[55]	RSDet	86.50
[56]	CSL	89.62
[57]	Gliding Vertex	88.20
[58]	TOSO	79.29
[59]	DAL	88.95
[60]	IENet	75.01
[61]	OFCOS	91.07
Our	Bayes R-CNN	91.23

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sharifuzzaman, S.A.S.M.; Tanveer, J.; Chen, Y.; Chan, J.H.; Kim, H.S.; Kallu, K.D.; Ahmed, S. Bayes R-CNN: An Uncertainty-Aware Bayesian Approach to Object Detection in Remote Sensing Imagery for Enhanced Scene Interpretation. Remote Sens. 2024, 16, 2405. https://doi.org/10.3390/rs16132405

AMA Style

Sharifuzzaman SASM, Tanveer J, Chen Y, Chan JH, Kim HS, Kallu KD, Ahmed S. Bayes R-CNN: An Uncertainty-Aware Bayesian Approach to Object Detection in Remote Sensing Imagery for Enhanced Scene Interpretation. Remote Sensing. 2024; 16(13):2405. https://doi.org/10.3390/rs16132405

Chicago/Turabian Style

Sharifuzzaman, Sagar A. S. M., Jawad Tanveer, Yu Chen, Jun Hoong Chan, Hyung Seok Kim, Karam Dad Kallu, and Shahzad Ahmed. 2024. "Bayes R-CNN: An Uncertainty-Aware Bayesian Approach to Object Detection in Remote Sensing Imagery for Enhanced Scene Interpretation" Remote Sensing 16, no. 13: 2405. https://doi.org/10.3390/rs16132405

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Bayes R-CNN: An Uncertainty-Aware Bayesian Approach to Object Detection in Remote Sensing Imagery for Enhanced Scene Interpretation

Abstract

1. Introduction

2. Related Works

2.1. Deep Learning-Based Object Detectors in Remote Sensing

2.2. Multiscale Feature Extraction

3. Preliminaries

3.1. Bayesian Convolutional Neural Network

3.2. Variational Inference

3.3. Bayes by Backpropagation with Variational Inference

3.4. Uncertainty Estimation in Bayesian CNN

4. Uncertainty aware Object Detection Method

4.1. Bayes R-CNN

4.2. Feature Extraction with Feature Pyramid Network

4.2.1. MRENet

4.2.2. Bayesian Feature Pyramid Network

4.3. Region Proposal Network

4.4. Final Prediction Layer

5. Image Super-Resolution Method

6. Experimental Results and Analysis

6.1. Experimental Platforms and Parameter

6.2. Bayesian Convolutional Neural Network

6.3. Bayes R-CNN Result Analysis

6.4. Image Super-Resolution on Remote Sensing Image Result Analysis

6.5. Uncertainty-Aware Object Detection Result Analysis

6.6. Background Classification

7. Discussion

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI