Next Article in Journal
Stability Analysis of Innovation Collaboration between Commercial Banks and Fintech Companies
Previous Article in Journal
Analysis and Controller Design for Parameter Varying T-S Fuzzy Systems with Markov Jump
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Optimized Object Detection Algorithm for Marine Remote Sensing Images

School of Computer Science and Engineering, Northeastern University, Shenyang 110169, China
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(17), 2722; https://doi.org/10.3390/math12172722 (registering DOI)
Submission received: 12 July 2024 / Revised: 17 August 2024 / Accepted: 28 August 2024 / Published: 31 August 2024

Abstract

:
In order to address the challenge of the small-scale, small-target, and complex scenes often encountered in offshore remote sensing image datasets, this paper employs an interpolation method to achieve super-resolution-assisted target detection. This approach aligns with the logic of popular GANs and generative diffusion networks in terms of super-resolution but is more lightweight. Additionally, the image count is expanded fivefold by supplementing the dataset with DOTA and data augmentation techniques. Framework-wise, based on the Faster R-CNN model, the combination of a residual backbone network and pyramid balancing structure enables our model to adapt to the characteristics of small-target scenarios. Moreover, the attention mechanism, random anchor re-selection strategy, and the strategy of replacing quantization operations with bilinear interpolation further enhance the model’s detection capability at a low cost. Ablation experiments and comparative experiments show that, with a simple backbone, the algorithm in this paper achieves a mAP of 71.2% on the dataset, an improvement in accuracy of about 10% compared to the Faster R-CNN algorithm.

1. Introduction

Remote sensing originated in the 1960s and can be understood as the acquisition of information about an object or area from a distance. With the improvement of spatial, temporal, and spectral resolution, remote sensing images play a huge role in various areas, such as the economy, resource management, and ocean management. In the field of ocean management in particular, due to the large areas involved and the difficulty of reaching these locations, it is irreplaceable. In most coastal provinces of China, a manual interpretation of remote sensing images to determine their location is still the main method used, which requires a lot of manpower and material resources and has extremely low efficiency. These factors have brought great difficulties to marine management and resource protection.
Nearshore areas are unique as a geographical location, and there are many effective targets of research in these areas, such as factories and ships. Timely information capture and the real-time monitoring of targets both onshore and offshore is an important issue. Therefore, achieving the fast and efficient automated processing of a large amount of remote sensing data is currently a key research direction. Target recognition and detection are among the basic tasks of this research [1]. Its essence is to find and classify possible targets in the image, and it plays an indispensable role in controlling the nearshore area.
The field of image classification and object detection has gained significant attention recently, especially after the success of AlexNet [2] in 2012, which won first place in the ImageNet classification competition. This breakthrough led to an increased focus on improving image classification and object detection techniques. VGG [3] and ResNet [4] have demonstrated excellent performances in feature extraction, contributing to image classification and object detection tasks. A notable advancement was the introduction of the R-CNN algorithm by Girshick, R. B. [5] in 2014, which successfully applied deep learning technology to object detection, setting the stage for further developments in the field. In 2015, Girshick, R. B. [6] proposed the Fast R-CNN algorithm, which had improved computational speed due to the replacement of the cumbersome selective search with a convolutional neural network. This innovation reduced the time it took to generate candidate boxes from 2 s to just 10 ms. Building on this, Ren, S. [7] introduced the Faster R-CNN algorithm in 2016, which remains popular today. This algorithm reconstructs the candidate region network throughout the model, enhancing detection speed and accuracy. To address rounding errors in the RoI Pooling layer, He, K. [8] proposed the Mask R-CNN algorithm in 2017, enabling simultaneous object detection and instance segmentation. In 2018, Cai, Z. [9] introduced the Cascade R-CNN to tackle real-time detection challenges by continuously adjusting the detection threshold during training for optimal performance. Additionally, in 2018, Ding, J. [10] proposed the RoI Transformer method, which enhances detection speed by using learnable parameters in the RoI Learner to transform the rotation box. Alongside two-stage algorithms like the R-CNN and its variants, one-stage algorithms have also been developed to prioritize detection speed. Notable examples include the YOLO series [11,12,13] and SSD [14], which focus on faster object detection. Remote sensing images present unique challenges due to their high shooting angle, unique imaging methods, high resolution, small target areas, dense arrangements, and complex backgrounds. These images are often affected by environmental interference such as weather conditions. Therefore, object detection in remote sensing requires specialized improvements tailored to these specific characteristics to achieve accurate results.
As a representative of two-stage object detection algorithms, recent research [15,16,17] has shown that, despite the existence of many object detection methods, Faster R-CNN is still one of the most widely used and accurate object detection algorithms. However, it has been improved upon in this paper. This paper proposes an optimized object detection algorithm for ocean remote sensing images based on Faster R-CNN. The main contributions of this paper are as follows:
(1) A reconstruction algorithm for ocean remote sensing images has been proposed. We adopted image super-resolution reconstruction technology to perform a bicubic interpolation on low-resolution remote sensing images and improved the Convolutional Neural Network for image Super-Resolution (SRCNN) reconstruction algorithm [18] to enhance the resolution of remote sensing images. Combining this with the large-scale Dataset for Object deTection in Aerial images (DOTA) [19] dataset, we performed image enhancement on the data using operations such as flipping, cropping, and unified annotation processing to obtain a certain number of training datasets.
(2) A feature extraction network for small targets has been proposed. We improved the residual network by constructing a balanced pyramid feature for the overall network. Through bottom-up, horizontally connected, and top-down structures, the feature information of small targets is fully learned. After each convolution operation, attention mechanisms are used to focus on the spatial and channel features of small targets.
(3) A candidate region selection method has been proposed. Anchor selection was performed on the subsequent candidate region network, giving the targets ample opportunities to enter the training network. The quantization operation of the feature maps and candidate boxes was canceled, and the coordinate information of floating-point numbers was retained, improving the algorithm’s accuracy and enhancing the model’s detection performance.
This paper is organized as follows: Section 2 introduces the dataset construction and image super-resolution reconstruction algorithms; Section 3 introduces the feature extraction networks for small targets; Section 4 introduces the candidate region generation network; Section 5 introduces the performance experiments and evaluations; and Section 6 summarizes the entire paper.

2. Dataset Construction and Image Super-Resolution Reconstruction

Remote sensing images come in various types due to differences in the sensors and platforms used, and these include visible light, hyper-spectral, and high-resolution remote sensing images. These images are widely used in computer vision tasks such as detection, classification, and tracking. In this paper, the data used are sourced from the near-sea area of Liaoning and the DOTA, which were selected based on specific criteria relevant to the study’s objectives. The near-sea area of Liaoning was chosen for its unique environmental conditions, which include a mix of small-scale and small-target scenarios under complex and often challenging weather conditions. This dataset provides a practical representation of real-world remote sensing challenges in maritime environments, making it ideal for testing and validating the proposed detection algorithms. The DOTA was selected due to its comprehensive coverage of various object categories and its high-quality annotations, which are critical for benchmarking object detection models. DOTA scenes are densely distributed and their background is complex, which can improve the robustness of the model and coincide with the research focus of improving the detection accuracy of remote sensing images. The selection of these datasets ensured that the research was grounded in real scenarios and reflected the specific challenges faced by remote sensing applications, thus enhancing the relevance and applicability of the research results.
The DOTA is a remote sensing image dataset proposed by Wuhan University in 2017. It is a dataset used for object detection in aerial photography, which aims to detect and evaluate target objects in remote sensing images. There are a total of 2806 aerial photography images in DOTA, including 15 common target categories, with the following characteristics: (1) a complex background; (2) diversity; and (3) small target.
The total number of targets in the dataset is 188,282. To analyze objects from different categories, specific statistics about quantities were obtained. The number of ships and vehicles is the highest, accounting for about 70% of the total. Our analysis shows that there can be thousands or hundreds of small targets similar to ships and vehicles in a single remote sensing image. This large number of small targets poses significant challenges to the target detection task. The visualization of the receptive field issue is shown in Figure 1. This figure displays specific images from different layers during network training. The left image is the original, the middle is the output from the fifth layer, and the right is the output from the second layer. This comparison shows that as the neural network learns from the bottom up, the details and textures of small target objects are clearer in the lower layers of the network, making it possible to distinguish between different types of small targets. However, in the top layers, the targets become very blurred, making it impossible to identify their categories with the naked eye, while the gaps between small targets are also difficult to discern.
During data augmentation, simple cropping reduces the resolution of some data, making the target unclear in the remote sensing image. Moreover, this dataset contains many small targets that occupy a large proportion of each image. The reduced resolution of remote sensing images complicates target detection. There are two areas that can be targeted in terms of insufficient resolution: (1) Scene Characteristics—where the aspect ratio of the bounding boxes to the image dimensions is mostly less than 0.1, this is generally considered a small object detection task. (2) Technical Characteristics—super-resolution has been shown to improve detection performance by enhancing the quality of data in small object detection. Therefore, in response to the problem of low-resolution remote sensing images, it is necessary to improve the resolution of the images without compromising their clarity and generate high-quality datasets for the subsequent object detection network. We researched bicubic interpolation and neural network methods and ultimately designed a super-resolution reconstruction method based on deep learning for this dataset.

2.1. Bicubic Interpolation

The bicubic interpolation algorithm can also be called cubic convolutional interpolation. The main idea is to calculate the interpolation kernel through specific formulas, and the form of the kernel function is a piecewise cubic polynomial.
Keys, R. G. [20] divided the two-dimensional plane interpolation problem into two one-dimensional interpolation operations so that this interpolation problem could be simplified into spatial interpolation operations. Figure 2 shows the principle of the bicubic interpolation algorithm. In Figure 2, the blue dots represent known data points, and the red dots (x, y) represent the pixel to be interpolated. When using the bicubic interpolation method, it is necessary to calculate the grayscale values of the 4 × 4 pixels near the red dot, obtain the influence factors of the 16 known points corresponding to the pixel to be interpolated, and find their relationship, namely the interpolation basis function. From this, the specific pixel values of (x, y) can be obtained.
Formula (1) shows the interpolation Basis Function (BF) for fitting data using bicubic interpolation:
B F ( x ) = ( a + 2 ) | x | 3 ( a + 3 ) | x | 2 + 1 f o r | x | 1 a | x | 3 5 a | x | 2 + 8 a | x | 4 a f o r 1 < | x | 2 0 o t h e r w i s e
If i, j = 0, 1, 2, 3 are among the 16 data points around the pixel to be interpolated and a = −0.5 in Formula (1), the specific interpolation calculation formula is as shown in Formula (2).
f ( x , y ) = i = 0 3 j = 0 3 f ( x i , y j ) B F ( x x i ) B F ( y y j )
The advantage of the bicubic interpolation algorithm is that it corrects the shortcomings of bilinear interpolation methods, taking into account both the values of neighboring points and their changing trends, thus obtaining smoother edges and improving computational accuracy. However, when performing interpolation operations, an excessive amount of data can actually cause a decrease in image resolution. Based on the analysis of convolutional networks, a super-resolution reconstruction technique based on convolutional neural networks was constructed and applied to the field of remote sensing images, achieving good results.

2.2. Improved SRCNN Algorithm

In 2014, Mnih, V. [21] proposed a visual domain algorithm with an attention mechanism and compared it with other similar algorithms, proving that this algorithm achieved a significant improvement in object detection. Currently, the focus of research in this field is on the super-resolution reconstruction of remote sensing images. The question is whether the detailed information provided by remote sensing images can be fully utilized during data processing while also amplifying the learned detailed information. Therefore, regarding the response of neural networks to high-frequency features, this paper aims to obtain effective reconstruction results by zooming in on specific details from original remote sensing images and obtaining more useful remote sensing images. This paper intends to use the SRCNN algorithm and introduce attention mechanisms into neural networks to enhance the overall network’s ability to learn detailed information in remote sensing images, thereby improving the accuracy of subsequent object detection networks.
The reason for the birth of the Spatial Transformer Network (STN) [22] is that, in convolutional neural network models, simple translation operations in pooling layers can cause significant differences in the final output of the model, resulting in losses. The STN model aims to output correct results even after the feature space has undergone transformations, thereby achieving more robust spatial invariance based on spatial transformations.
The Squeeze and Excitation Networks (SENet) [23] is a classic algorithm that focuses on channel dimensions. Its main idea is that in CNNs, feature maps contain both spatial and channel features. From the perspective of spatial features, convolution operations are used during forward propagation to obtain the final feature map from the original image. Its convolution operations are limited to processing the local features in the image, and, as forward propagation continues, the deep receptive field in convolutional neural networks becomes much larger than the shallow receptive field. Therefore, in deep networks, a lot of spatial dimension information cannot be easily obtained through convolution operations alone. For channel features, precise feature maps can be obtained by combining the spatial features obtained through convolution operations along the channel dimension.
In neural networks, convolutional kernels of different sizes can obtain corresponding feature maps, and by combining these results the final output map of the network can be obtained. Among the two types of attention mechanisms mentioned above, SENet assigns different weight values to feature maps, giving them varying degrees of attention. In contrast, STN does not use different weight values, meaning that the overall weight of each feature map is the same. However, there are differences in the weights within individual feature maps. After exploring the characteristics of these two mechanisms, this study combined them into a super-resolution reconstruction technique to learn the feature information in images.
The Convolutional Block Attention Module (CBAM) [24] is based on the two aforementioned mechanisms, and its advantage is that it is very lightweight and can be embedded in almost any CNN. It mainly combines the features of spaces and channels by performing element-wise multiplication on both, thereby learning their respective focuses. While ensuring minimal computations and parameters, it can improve model performance, making it suitable for any convolutional neural network and enabling comprehensive training accordingly. Figure 3 shows the structural diagram of the CBAM, and Formula (3) presents its calculation formula.
H = M c ( H ) × H ; H = M s ( H ) × H
In Formula (3), H R C × H × W represents the input, M c R C × 1 × 1 represents the feature map obtained through channel attention, H R C × H × W represents the feature map obtained after the channel attention mechanism has been run and broadcast, M s R 1 × H × W represents the feature map obtained after the spatial attention module has been used, and H R C × H × W is the output of the entire module. Multiplication represents the corresponding broadcasting operation. Figure 4 shows this specific channel attention mechanism’s flowchart. Formula (4) provides the specific algorithm for the channel attention mechanism M c ( H ) .
M c ( H ) = S i g m o i d ( M L P ( A v g ( H ) ) + M L P ( M a x ( H ) ) ) = S i g m o i d ( M L P ( H a v g c + H m a x c ) )
H is input into two pooling layers, which perform a dimensionality reduction on H and record the resulting features as H a v g c and H m a x c , respectively. Next, a shared network is established for these two features, which is composed of a Multi-layer perceptron. We input H a v g c and H m a x c into the shared network, compress them, and then restore them. The output result is the sum of its elements. Finally, the activation function sigmoid is used to weight each channel in the feature map and generate H’.
Our specific spatial attention mechanism’s flowchart is shown in Figure 5. Formula (5) provides the specific process of the M s ( H ) operation.
M s ( H ) = S i g m o i d ( f 7 × 7 ( [ A v g ( H ) ; M a x ( H ) ] ) )   = S i g m o i d ( f 7 × 7 ( [ H a v g s ; H m a x s ) )
H’ is used as the input for the spatial attention mechanism, and through two pooling layers, the mean H a v g s and maximum H m a x s along the channel direction in the feature map H’ are obtained. These two vectors are connected using the Concat layer to generate effective features for R 2 × H × W . Next, an R 1 × 7 × 7 sized convolutional layer f 7 × 7 is used to perform operations on the connected features, generating a spatial attention feature map of R 1 × H × W . Finally, we use the activation function sigmoid to determine the weight setting of the feature map.
Finally, the obtained spatial attention feature map M s ( H ) is multiplied and broadcast with the input feature map H’ to obtain the feature map H”. Then, two fully connected layers are used to reconstruct the feature map and obtain the predicted image. Based on the SRCNN algorithm, two attention mechanisms are first used to assign different weights to the feature maps. Then, three convolutional layers are used to achieve a mapping from low resolution to high resolution, improving the overall resolution of the remote sensing image. The structure of the network model used in this study is shown in Figure 6.
Firstly, we input the preprocessed remote sensing image and introduce the channel attention mechanism into the network, assigning different weight values to different channels of the remote sensing image. Then, through the spatial attention mechanism, we improve the partial weight values of the high-frequency detailed information in the space. We input the obtained feature map into the SRCNN, and, after three stages of operation, the first layer of the CNN extracts features from the input feature map, with a convolution kernel size of 9 × 9 × 64. The generated features are then subjected to non-linear mapping operations, with a convolution kernel size of 1 × 1 × 32. The last layer reconstructs the mapped features to obtain an image with improved resolution, with a convolution kernel size of 5 × 5 × 1. In Figure 6, c takes the size of the Y channel in YCrCb, so c = 1. The stride of the three convolution operations is 1, and only the first two layers use the relu activation function.
The Faster R-CNN algorithm is an end-to-end processing process that can complete the three steps of feature extraction—candidate region creation, target classification, and regression—within the same network, as shown in Figure 7.
The basic steps of this algorithm are divided into three stages: first, convolution and pooling operations are stacked in the feature extraction network until the feature information in the image is extracted and a corresponding feature map is obtained; secondly, instead of using selection search methods, feature maps are fed into the RPN and a large number of candidate boxes are generated using the anchor box mechanism of the network; thirdly, the pooling layer in the classification and regression network is used to unify the size of each candidate region, and, during this stage, the category of the target in the image is identified, while the regression layer is used to fine-tune its position.
In the original Faster R-CNN, only the feature information of small targets is learned in the last layer of the feature extraction network. When the convolution and pooling operations reach the last layer, the semantic information of the small targets almost disappears. Due to the presence of many small targets in the studied remote sensing image data, such as airplanes, vehicles, ships, etc., it is difficult to carry out effective target detection. This paper improves the feature extraction network and the subsequent candidate region generation network of the Faster R-CNN algorithm. The following is a detailed introduction to both.

3. Feature Extraction Network for Small Targets

Based on the core idea of enhancing the attention the algorithm pays to the receptive field in its lower layers, this paper improves the feature extraction network of the Faster R-CNN algorithm, resulting in the Sub-convolutional Attention Feature Pyramid Network (SA-FPN). This network not only optimizes the convolutional layers of the residual network but also establishes a feature-balanced pyramid structure that is integrated with an attention mechanism. This design strengthens its feature extraction capability specifically for the small targets present in remote sensing images, enabling the network to continuously learn small target information during training, thus achieving a better detection performance. The residual network not only prevents network degradation but also uses skip connections to ensure that lower-layer information is not overshadowed by higher-level abstract features during feature transformation.
(1) Improvement of the residual network
To decompose the first layer of a convolutional network with a kernel size of 7 × 7 and a stride of 2 into three layers, we set its convolutional kernels to 3 × 3 without changing the stride size. This not only retains the information learned by the original convolutional layer but also deepens the network hierarchy, allowing the residual network to better extract feature information. The second, third, fourth, and fifth layers all use a large number of residual modules to perform convolution. Each residual module consists of a 3 × 3 convolutional layer and two 1 × 1 convolutional layers. The number of residual modules in each stage is 3, 4, 6, and 3, respectively. The initial residual module of each layer undergoes a downsampling operation. Generally, the two branches are placed in a 1 × 1 convolutional layer with a stride of 2. The downsampling of the first branch is achieved using a 3 × 3 convolutional kernel, and the downsampling of the second branch is achieved using an average pooling layer. This approach avoids features in the image being ignored and enables the residual network to fully learn feature information [25]. The comparison of the modules before and after this improvement is shown in Figure 8.
(2) Feature-Balanced Pyramid Structure
To address the presence of extremely small targets, often just a few pixels in size, within the dataset, this model establishes direct pathways between feature maps of different granularities, further enhancing the completeness of information extraction. Of the feasible multi-scale feature fusion techniques that could be used, the model tests and incorporates a feature-balanced pyramid structure.
The feature pyramid structure [26] first horizontally connects the predicted image after feature extraction from the original feature map and then uses upsampling operations to construct a structure that progresses from the bottom to the top and then from the top to the bottom. This structure effectively avoids the problem of excessive computational complexity and can be applied in object detection algorithms. By obtaining a feature map with the same number of channels through convolution, excessive computational complexity can be avoided, and the top-level and low-level networks can fuse with each other. The low-level network can use more information to assist the top-level network in target recognition, fully learning the small targets present in remote sensing images. The right side of Figure 9 illustrates a feature extraction network that incorporates a feature pyramid. Ablation experiments show that this leads to an accuracy improvement of nearly 4%.
(3) Integrating attention mechanisms
To reflect the effectiveness of the improvement strategy itself and avoid interference from the backbone, we chose the basic backbone ResNet. On the basis of improving the residual network, this study includes a channel attention mechanism, which assigns different levels of attention to feature maps, effectively differentiating information and allowing the feature maps to fully learn feature information. This approach focuses attention on the extraction of effective information. Additionally, by utilizing the spatial attention mechanism to focus on different regions of the feature map, it is possible to selectively filter out different weights of important information. The left side of Figure 9 shows the framework obtained by integrating the ResNet50 residual network with the CBAM module.
Finally, the residual network was modified into a convolutional form and a feature-balanced pyramid structure was constructed. Attention modules were added to the end of each convolutional layer, resulting in the final feature extraction network used in this study being a multi-scale attention pyramid structure. Figure 9 illustrates the specific structure of the network.
Unlike the original feature extraction network, the Conv1 layer of the backbone network ResNet50 is first improved into a convolutional network structure. This modification not only retains the information learned by the original convolutional layer but also deepens the network hierarchy and enhances its extraction ability. Next, the Conv2 to Conv5 layers are constructed as a dual-path feature extraction structure, using convolutional and pooling layers for downsampling operations to preserve the pixels of the image as much as possible. Additionally, an attention module is added to the end of each layer to fuse the output features in both the spatial and channel dimensions, ensuring that all features present in the image are extracted. Finally, using a multi-scale structure, the continuously decreasing feature maps obtained during the bottom-up process are horizontally connected and then upsampled twice from top to bottom to return to their original size. The predicted maps obtained from the convolutional layers are then linearly fused to obtain accurate predicted feature maps.

4. Candidate Region Generation Network

To enhance the model’s object detection performance, optimizing both its anchor point selection and region alignment processes is crucial. This study first addresses improvements in anchor point selection within the Region Proposal Network (RPN). By reintroducing randomly discarded anchors during training, the model better preserves and learns the features of small targets, which enhances its overall detection accuracy. Subsequently, the focus shifts to the alignment of candidate regions. This study proposes replacing the RoI Pooling layer with the RoI Align layer to mitigate quantization errors, thereby ensuring more precise feature extraction from candidate regions. Together, these enhancements—improved anchor selection and refined region alignment—contribute to the development of a more accurate and robust object detection network.
(1) Random anchor point selection
After optimizing and improving the feature extraction network, some corresponding improvements have been made to the RPN mechanism. The RPN has successfully replaced the time-consuming and labor-intensive method of selective search by generating anchor boxes of different scales using various anchors to obtain candidate regions. In this process, the selection of anchors is particularly important. During training, tens of thousands of different anchors are generated, but ultimately only the top 2000 with the highest foreground confidence are selected. This results in competition among anchors.
Due to the large number of small target objects in these images, there are also many anchor points associated with them. During the filtering process, these small target anchors can be replaced by those of larger targets, causing an imbalance and making subsequent target classification difficult. To address this issue, this study suggests that by implementing a supplementary selection mechanism for anchor points, the features of small targets can be better preserved and learned. This approach enables the subsequent classification and regression networks to achieve more accurate discrimination and fine-tuning.
In the original candidate region generation network, the selection of anchors involves three main steps:
1. Sorting Anchors by Foreground Confidence: After all anchors and anchor boxes in a feature map are generated, they are sorted according to the foreground confidence calculated by the network. Based on the criterion that a higher confidence indicates a higher likelihood of the anchor box containing a complete target, anchors with a higher confidence are selected.
2. Non-Maximum Suppression (NMS): The selected anchor points then undergo non-maximum suppression. The specific operation of NMS involves deleting prediction boxes with excessive intersection-over-union (IoU) ratios. If the IoU between two prediction boxes is too large, indicating that they are close to each other, the redundant prediction boxes are removed to avoid redundancy in the trained network.
3. Distinguishing Positive and Negative Samples: The remaining anchor points are used to distinguish between positive and negative samples based on the IoU ratio between the predicted box and the true box, typically using a ratio of 1:3. After this series of operations, only positive samples, selected layer by layer, are used for network training.
During this process, most of the small target information is discarded, which is very unfavorable for the detection and recognition of small targets. To address this issue, in these three steps, this study will randomly select some of the deleted anchor points and reintroduce them into the subsequent screening process. By continuously reintroducing these eliminated candidate boxes during network training, the model can fully learn different features’ information. This approach makes the information of small targets more likely to be extracted, thereby improving the accuracy of the model.
(2) Align regions of interest
The feature map and candidate regions, after passing through the candidate region network, need to have the same dimensions as the subsequent classification and regression networks. Therefore, a very important component in the Fast R-CNN algorithm proposed by Girshick et al. is the RoI Pooling layer, which meets the input requirements of the subsequent fully connected layers and produces uniformly sized images. The current algorithm has an added candidate region generation network mechanism, resulting in a significant increase in the number of RoIs. Due to the requirement of fully connected layers for the input of the classification and regression networks, all generated feature maps need to be of a fixed size. However, the positions of candidate boxes are usually obtained after regression operations involving decimals, so the RoI Pooling layer needs to perform a quantization function. The specific quantization error is shown in Figure 10.
As shown in Figure 10, the original feature map has a side length of 800. According to the scaling step of the backbone network in the feature extraction network, which is 32, the side length is exactly divisible by 25. However, the length of the candidate box is 300. Dividing 300 by 32 results in 9.375, a floating-point number. Truncating the decimal yields 9. Consequently, quantization occurs in the boundary region, which is divided into an average of 7 × 7 windows. The rectangular edge length in the figure is 9/7, resulting in 1.28, another floating-point number. This is again quantized to 1. Clearly, there are significant errors in these operations. During the process, a deviation of 0.1 maps back to the original image, resulting in a 3.2-pixel error. The error generated by the RoI in the figure is over 20 pixels, and the impact of these differences cannot be underestimated.
This study uses the RoI Align layer instead of the RoI Pooling layer. RoI Align uses bilinear interpolation instead of simple rounding operations, preserving floating-point numbers and pixel values to their maximum extent. RoI Align is not just used for correcting the boundary area of coordinate points; it also performs pooling operations and implements a complete process. Firstly, the floating-point pixel values of each candidate region are retained without rounding. Next, when dividing the candidate region into 7 × 7 windows, no rounding is performed. Instead, the surrounding coordinates are selected, and their positions are calculated. Bilinear interpolation is then used to obtain the coordinates of the desired points, and finally the maximum value is taken.
At this point, the complete optimization of the Faster R-CNN algorithm has been outlined. Firstly, in the feature extraction network, a feature pyramid structure was incorporated to address the dense arrangement of small targets in the dataset. Additionally, the backbone network ResNet50 was deepened to solve existing problems. Attention modules were added to the end of each level, combining channel attention with spatial attention. By assigning weight values to different features, the network focuses more on the target information in the image, forming the new SA-FPN feature extraction network.
In the candidate region generation network, the deletion operation of anchor boxes was randomly supplemented, allowing for the continuous input of potentially missing target information into the network. Furthermore, when the subsequent pooling layer is unified to a fixed size, the scale of the feature maps and candidate boxes is no longer quantized, preserving floating-point features. This improves the accuracy of prediction boxes for subsequent targets, enhancing the overall detection accuracy of the network and forming the new RA-RPN network.

5. Performance Experiments and Their Evaluations

This study was conducted and tested on Windows 10, with CUDA 10.1 and CUDNN 7.4.1 installed for GPU acceleration. For the experiment related to remote sensing technology, LabelImg software was used to label the targets in images. The programming language of the deep learning algorithm was Python 3.6, which was managed using Anaconda. The entire algorithm was implemented using two deep learning frameworks, TensorFlow and Keras.
The algorithm model used in this study will be trained end-to-end based on the Faster R-CNN algorithm. The training procedure includes the following four steps: transfer learning, where the ResNet50 network, pretrained on the VOC dataset, is utilized as the backbone network; the initialization of convolutional layers using a Gaussian distribution with a mean of 0 and a standard deviation of 0.01; parameter tuning, where, after transfer learning, the network parameters are fine-tuned using the custom training dataset developed in this study; and model tuning using the ReLU activation function. The primary advantage of ReLU is its ability to output the maximum value, which is computationally more efficient compared to the sigmoid and tanh functions. These functions involve exponential operations that increase training time and could lead to saturation, causing gradient vanishing, which is particularly problematic in two-stage models where the detection time is already a bottleneck. In this study, no additional computational load is introduced into the Faster R-CNN algorithm. Given that the experimental dataset is small, to prevent overfitting, the loss value during network training is visualized to help assess the effectiveness of the model parameters. As shown in Figure 11, the model loss tends to stabilize after approximately 40,000 iterations. Thus, this study conducts 40,000 iterations of model training, with a batch size of 256 and 10 epochs. Due to the inherent randomness and complex loss functions in deep learning training, the model is saved every 1000 iterations. After 30,000 iterations, the performance of each saved model is tested to determine the optimal detection model. The learning rate is set to 0.001, with a momentum coefficient of 0.9. As in the algorithm described in Chapter 3, the loss function is optimized using the gradient descent method, with a weight decay coefficient of 0.0001.
Dataset construction: Data are a crucial part of deep learning, and the quantity and quality of datasets directly affect experimental results. This study selected ships, airplanes, and vehicles as its research sample. The dataset consists of visible-light remote sensing images from sources such as high-resolution 1 and high-resolution 2, covering the entire sea area of Liaoning, as well as DOTA data. The entire dataset is in VOC format and contains 1500 original images. After data augmentation such as flipping, cropping, and rotation, a total of 7500 images were obtained. The dataset is divided into training, validation, and testing sets in a 7:2:1 ratio.
Data augmentation includes the flipping, rotating, cropping, and super-resolution reconstruction of images with insufficient resolution. The annotation files are in XML format, corresponding one-to-one with each image. Each annotation file defines the ground truth contours, with ( x m i n , y m i n ) representing the upper left corner and ( x m a x , y m a x ) representing the lower right corner. The file also includes the target category (aircraft, ship, car). The rectangular area in which the target is located is selected by confirming the known coordinates of two points. The annotation effect is shown in Figure 12.
To evaluate our proposed solution, this paper uses commonly used metrics: Precision, Recall, Average Precision (AP), and Mean Average Precision (mAP).
Table 1 presents the performance of the object detection algorithm under different configurations. The first row represents the baseline accuracy of the original Faster R-CNN algorithm, showing its detection accuracy for three target classes, airplanes, ships, and cars, without any modifications. The combination of super-resolution processing with ResNet50 results in a slight increase in mAP (from 61.6% to 62.2%), indicating that while super-resolution techniques have a limited effect on detection performance, their integration with an improved ResNet50 does provide some benefit. However, when anchor re-selection and the CBAM attention module are individually introduced, there is a significant improvement in mAP, which reaches 65.2% and 66.1%, respectively. This suggests that the CBAM effectively enhances the detection accuracy of small targets, particularly in the detection of ships and cars.
The introduction of the Feature Pyramid Network (FPN) further boosts the model’s detection accuracy, particularly for ships and cars, although it results in a slight decrease in airplane detection accuracy. This could be due to the FPN’s emphasis on improving the recognition of smaller targets, while airplanes, being relatively larger, are less affected by these enhancements. Finally, the combined use of anchor re-selection and RoI Align further optimizes the position and size of prediction boxes. While the increase in mAP is relatively modest, the overall accuracy is improved, reaching a final mAP of 71.2%. This outcome demonstrates that improvements to the candidate region generation network, such as eliminating quantization errors and reintroducing some discarded anchors, lead to more precise prediction boxes and thus enhance overall detection performance. In summary, the cumulative effect of these optimization strategies results in nearly a 10% increase in overall detection accuracy compared to the baseline, highlighting the effectiveness of these enhancements in boosting the object detection algorithm’s performance.
To rigorously evaluate the proposed method, it was compared with several well-known lightweight object detection models. SSD and YOLOv2 are renowned for their speed and efficiency. Faster R-CNN, as a representative of two-stage models and the foundation of our implementation, is particularly valuable for assessing the effectiveness of our strategy. Mask R-CNN, built on Faster R-CNN and with additional segmentation capabilities, is used to evaluate the performance of our method in more complex scenarios.
Algorithm Comparison Experiment: In order to verify the advantages of the detection algorithm proposed in this paper, a comparative analysis was conducted between our algorithm and the SSD, YOLOv2, Faster R-CNN, and Mask R-CNN algorithms. The results, shown in Table 2, indicate that the overall experimental accuracy of our algorithm is relatively high, verifying the feasibility of the proposed algorithm.
Overall, the detection performance of our algorithm is superior to other networks, as it has an improved detection accuracy compared to the original algorithm and surpasses that of several commonly used models. Its detection accuracy for airplanes, ships, and cars reaches 87.5%, 63.6%, and 61.7%, respectively, with an average evaluation accuracy of 71.2%.
After conducting ablation experiments and comparative experiments with different algorithms, in order to better demonstrate the performance of our algorithm, its detection effect was subjectively evaluated. The image to the left in Figure 13 shows its detection results for ships, while the image on the right shows its detection results for cars.
Overall, the improved object detection algorithm model presented in this paper has enhanced the detection of small targets and reduced the incidence of missed detections. Through a series of experimental data comparisons and subjective evaluations, it has been verified that the algorithm proposed in this paper is feasible for remote sensing applications in sea areas containing a large number of small targets.

6. Conclusions

This study addresses the issue of low accuracy in detecting small targets in traditional remote sensing image detection algorithms by implementing several improvements to the Faster R-CNN model, including a residual network enhancement and random anchor point selection. Experiments conducted on a real-world dataset mixed with open-source data, along with ablation studies and comparisons with other algorithms, demonstrate that the proposed improvements lead to a 10% increase in overall average accuracy compared to the original Faster R-CNN algorithm. This confirms our algorithm’s excellent performance in remote sensing image detection in the Liaoning sea area. Compared to the popular Transformer series methods, the modules and super-resolution optimization techniques used in this study are lighter and more user-friendly.
However, this study has some limitations. First, due to the small size of the test dataset, the algorithm’s generalization capability needs to be further validated on larger datasets. Additionally, this study mainly focuses on improving the detection accuracy of small targets, and its performance in other complex scenarios has not yet been thoroughly explored. Future work could focus on the following areas: expanding the dataset’s size and diversity to enhance the algorithm’s robustness; integrating more advanced detection technologies into the algorithm, such as better data augmentation and detection heads, to further improve its detection performance; and exploring the algorithm’s potential applications in other remote sensing scenarios, such as urban monitoring and environmental surveillance.

Author Contributions

Conceptualization, J.L.; methodology, Y.R.; software, Y.R.; validation, Y.B.; investigation, J.L.; data curation, Y.R.; writing—original draft preparation, Y.R.; writing—review and editing, G.Y.; supervision, Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (grant numbers 62137001 and 62272093).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Han, J.; Cheng, G. A survey on object detection in optical remote sensing images. ISPRS J. Photogramm. Remote Sens. 2016, 117, 11–28. [Google Scholar]
  2. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1106–1114. [Google Scholar]
  3. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  4. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  5. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
  6. Girshick, R.B. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  7. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence, Montreal, QC, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
  8. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  9. Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving Into High Quality Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
  10. Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI Transformer for Oriented Object Detection in Aerial Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 2849–2858. [Google Scholar]
  11. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  12. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
  13. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  14. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
  15. Zou, X.; Wu, C.; Liu, H.; Yu, Z.; Kuang, X. An accurate object detection of wood defects using an improved Faster R-CNN model. Wood Mater. Sci. Eng. 2024, 1–7. [Google Scholar] [CrossRef]
  16. Zhang, H.; Shao, F.; Chu, W.; Dai, J.; Li, X.; Zhang, X.; Gong, C. Faster R-CNN based on frame difference and spatiotemporal context for vehicle detection. Signal Image Video Process. 2024, 18, 7013–7027. [Google Scholar] [CrossRef]
  17. Bai, T.; Luo, J.; Zhou, S.; Lu, Y.; Wang, Y. Vehicle-Type Recognition Method for Images Based on Improved Faster R-CNN Model. Sensors 2024, 24, 2650. [Google Scholar] [CrossRef] [PubMed]
  18. Dong, C.; Loy, C.C.; He, K.; Tang, X. Image Super-Resolution Using Deep Convolutional Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
  19. Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
  20. Keys, R.G. Cubic convolution interpolation for digital image processing. IEEE Trans. Acoust. Speech Signal Process. 1981, 29, 1153–1160. [Google Scholar] [CrossRef]
  21. Mnih, V.; Heess, N.; Graves, A. Recurrent Models of Visual Attention. In Proceedings of the Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2204–2212. [Google Scholar]
  22. Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial Transformer Networks. In Proceedings of the Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 2017–2025. [Google Scholar]
  23. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
  24. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  25. He, T.; Zhang, Z.; Zhang, H.; Zhang, Z.; Xie, J.; Li, M. Bag of Tricks for Image Classification with Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 558–567. [Google Scholar]
  26. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
Figure 1. Feature maps.
Figure 1. Feature maps.
Mathematics 12 02722 g001
Figure 2. Bicubic interpolation.
Figure 2. Bicubic interpolation.
Mathematics 12 02722 g002
Figure 3. CBAM structure diagram.
Figure 3. CBAM structure diagram.
Mathematics 12 02722 g003
Figure 4. Flowchart of channel attention mechanism.
Figure 4. Flowchart of channel attention mechanism.
Mathematics 12 02722 g004
Figure 5. Flowchart of spatial attention mechanism.
Figure 5. Flowchart of spatial attention mechanism.
Mathematics 12 02722 g005
Figure 6. Diagram of super-resolution reconstruction network’s structure.
Figure 6. Diagram of super-resolution reconstruction network’s structure.
Mathematics 12 02722 g006
Figure 7. A structure diagram of the Faster R-CNN algorithm.
Figure 7. A structure diagram of the Faster R-CNN algorithm.
Mathematics 12 02722 g007
Figure 8. Comparison diagram of residual network before and after its improvement.
Figure 8. Comparison diagram of residual network before and after its improvement.
Mathematics 12 02722 g008
Figure 9. The structure diagram of an SA-FPN feature extraction network.
Figure 9. The structure diagram of an SA-FPN feature extraction network.
Mathematics 12 02722 g009
Figure 10. Quantization error.
Figure 10. Quantization error.
Mathematics 12 02722 g010
Figure 11. Loss trends during training.
Figure 11. Loss trends during training.
Mathematics 12 02722 g011
Figure 12. The results of dataset annotation.
Figure 12. The results of dataset annotation.
Mathematics 12 02722 g012
Figure 13. The results of detection.
Figure 13. The results of detection.
Mathematics 12 02722 g013
Table 1. The ablation experiment.
Table 1. The ablation experiment.
Super-ResolutionResNet 50 *FPNCBAMAnchor Point by ElectionRoI AlignAircraftShipsCarsmAP
80.850.051.161.6
81.051.252.162.2
88.756.752.365.2
88.056.955.366.1
87.757.756.267.2
87.857.957.068.8
87.563.661.771.2
Table 2. Multi-category task confusion matrix.
Table 2. Multi-category task confusion matrix.
ClassAP
SSDYOLOv2Faster R-CNNMask R-CNNOurs
Airplane57.876.980.882.687.5
Boat24.752.450.058.263.6
Car36.938.751.153.361.7
mAP29.939.261.665.871.2
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ren, Y.; Li, J.; Bao, Y.; Zhao, Z.; Yu, G. An Optimized Object Detection Algorithm for Marine Remote Sensing Images. Mathematics 2024, 12, 2722. https://doi.org/10.3390/math12172722

AMA Style

Ren Y, Li J, Bao Y, Zhao Z, Yu G. An Optimized Object Detection Algorithm for Marine Remote Sensing Images. Mathematics. 2024; 12(17):2722. https://doi.org/10.3390/math12172722

Chicago/Turabian Style

Ren, Yougui, Jialu Li, Yubin Bao, Zhibin Zhao, and Ge Yu. 2024. "An Optimized Object Detection Algorithm for Marine Remote Sensing Images" Mathematics 12, no. 17: 2722. https://doi.org/10.3390/math12172722

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop