2.3.1. Algorithm Framework
The target recognition network algorithm in StereoYOLO is based on the improved YOLOv5 object detection algorithm, a leading algorithm in the field of object detection. The algorithm divides the image into grids and detects objects within these grids. Each grid cell is responsible for detecting targets within itself. Due to its excellent efficiency and accuracy, YOLO has become one of the most famous object detection algorithms.
Since its release, YOLOv5 has continuously improved its algorithm to enhance efficiency and accuracy. In this paper, we use the YOLOv5 v6.0 version code implementation. The model size can be divided into YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. The weights of the five models are stacked sequentially. The algorithm is based on the improved YOLOv5s detection model.
The target recognition network framework in the StereoYOLO algorithm is shown in
Figure 4. The main structure of the algorithm consists of the Backbone, Feature Pyramid Network (FPN), and Yolo Head. Here, Resunit represents the residual module, Conv represents the convolutional block, BN represents batch normalization, SiLU represents the activation function, Concat represents merging arrays, and MaxPool represents max pooling. Channel Attention represents the channel attention module, Spatial Attention represents the spatial attention module, and complex network structures are established by combining basic modules. This algorithm replaces the CSP1_X module in the backbone network with the improved CSPCBAM and CSPCA modules with attention mechanisms, using channel and spatial attention mechanisms to apply targeted weighting for maritime target recognition scenarios, thus optimizing the detection accuracy of maritime target detection tasks.
2.3.2. Backbone Network
The algorithm uses the CSPDarknet backbone network to extract features from the input images and output them as feature layers. This process is performed three times to obtain three feature layers, called effective feature layers.
At the input end, the backbone network employs the Focus network structure, extracting a value from every other pixel in each image to generate four independent feature layers. This expands the input channel count by four times, resulting in a better depth compared to the three-layer structure in other networks.
The backbone network extensively employs residual networks (Resunit), where the residual edges are not processed, and the input and output of the backbone are directly combined to avoid gradient dispersion and network degradation problems. The use of residual networks alleviates the vanishing gradient problem caused by increased depth in the backbone network. At the same time, CSPnet splits the stacking of residual blocks, with the main part continuing the stacking of the original residual blocks and the remaining part directly connected to the end after minimal processing. Additionally, the backbone network employs SiLU as the activation function, which has characteristics such as unbounded upper, bounded lower, smooth, and non-monotonic. By smoothing the ReLU activation function, SiLU performs better than ReLU in deep models.
The final part of the backbone network is the spatial pyramid pooling (SPP) structure, which extracts features through max pooling with different kernel sizes, increasing the receptive field.
2.3.6. Attention Mechanism-Based Maritime Target Detection Algorithm
In maritime target recognition tasks, camera-captured maritime targets are predominantly located near the sea–sky line, necessitating tailored attention mechanisms for enhanced detection. These targets, often represented as tensors, exhibit spatial correlations critical for recognition.
To capitalize on this, a spatial attention module is employed, assigning higher weight values to tensor regions where targets are more likely to emerge, typically the lower half of the image in maritime scenarios. Concurrently, the unique environmental backdrop of bluish skies and greenish seas in these tasks influences the pixel values in captured images. Specifically, in the red, green, and blue (RGB) channels, blue, and green values tend to be higher due to this backdrop, whereas pixels containing maritime targets display varied channel values, reflective of distinct target features. This variation underscores the necessity of channel-specific attention.
By introducing channel attention, the algorithm can differentially weigh the RGB channels based on their relevance to target features, enhancing detection accuracy. Such channel-specific adjustments, in synergy with spatial attention, forge a more nuanced and effective approach to maritime target recognition, optimizing the likelihood of accurate target identification.
Due to the specificity of maritime target recognition tasks, this paper uses the attention mechanism to improve the maritime target detection algorithm, expecting to achieve better target detection accuracy. In this paper, CBAM attention module and CA attention module are separately introduced into the backbone network to improve recognition accuracy by targeting the distribution characteristics of targets to be recognized in space and channels in maritime target recognition tasks. Attention modules are introduced after the CSP feature extraction module to change the model’s sensitivity to different regional features. The improved CSPCBAM and CSPCA network structures are shown in
Figure 5:
In the figure, Channel Attention represents channel attention, Spatial Attention represents spatial attention, X-Axis Avg Pool represents X-axis average pooling, Y-Axis Avg Pool represents Y-axis average pooling, X-Axis Attention represents X-axis attention, and Y-Axis Attention represents Y-axis attention.
CBAM combines channel attention and spatial attention mechanisms. Given an intermediate feature map, the module sequentially infers attention maps along the two independent dimensions of channels and space, and then multiplies the attention maps by the input feature map for adaptive feature refinement.
In the CBAM attention module, given an intermediate feature map
as input, the module successively infers a 1D channel attention map
and a 2D spatial attention map
. The algorithm structure is represented by Equation (11):
where
F represents the output tensor of the CSP module,
F′ represents the weight tensor after channel attention processing, and
F″ represents the weight tensor after channel-spatial attention mechanism processing. ⊗ represents the tensor inner product.
The Convolutional Block Attention Module (CBAM) sequentially applies two distinct attention mechanisms to the input tensor: first, the channel attention module assesses the importance of each feature channel, then the spatial attention module evaluates the significance of different spatial regions. The tensor is progressively refined through these mechanisms, with the final output reflecting the enhanced feature representation after attention has been applied.
The properties of the CBAM attention module can take into account both channel attention and spatial attention, and affect the weights of related feature values, which meets the needs of maritime target recognition tasks. Therefore, this paper introduces it into the feature extraction backbone network for the improvement of maritime target recognition tasks.
CA attention utilizes the different probabilities of detecting targets in image width and height to optimize target detection tasks. It encodes precise target information in image width and height, inputs the feature map, and performs global average pooling operations in the
h and
w directions, respectively, to obtain feature maps in the
h and
w directions. The formula is shown in Equation (12):
The feature maps along the
h and
w directions are concatenated, and then sent into a 1 × 1 convolution block with shared weights. The feature map
F1 is then subjected to batch normalization and sent into a Sigmoid activation function to obtain the feature map
f, as shown in Equation (13):
Following that, perform a 1 × 1 convolution on the feature map f according to the original height and width, obtaining new feature maps
Fh and
Fw. After applying the Sigmoid activation function, the attention maps for the
h and
w directions,
gh and
gw, can be obtained, as shown in Equation (14):
Finally, compute the multiplication of the attention weights
gh and
gw with the original feature maps in the
h and
w directions, respectively, to obtain the feature maps with attention weights in the
h and
w directions, as shown in Equation (15):
The CA attention module primarily focuses on spatial attention and refines it into attention weights in the image height and width directions, generating attention-weighted feature maps. This provides a targeted improvement for the greater probability of targets appearing near the sea–sky line in maritime target recognition tasks.
2.3.7. Experimental Platform
Our research is based on the ZED stereo camera for development, as shown in
Figure 6, with a resolution of 1920 × 1080 (single-lens) and capturing 30 frames per second. The development platform utilizes a Windows 10 21H2 operating system, 64 GB RAM, a CPU i7-10700F with a base frequency of 2.9 GHz, a GPU RTX A4000, and Python 3.9.7 as the experimental platform.
The experimental platform uses a 1.8 m test boat as the target. By performing target detection, ranging, and positioning experiments on the platform, the accuracy of the maritime target recognition algorithm and ranging errors can be verified, thereby validating the target positioning algorithm.
To meet the real-time and offline computing requirements of maritime target recognition and ship motion state monitoring tasks, the mobility and power requirements of the monitoring platform must be considered. Therefore, it is necessary to port the maritime target recognition algorithm to make it suitable for small, embedded devices.
The embedded platform in this research adopts the Ubuntu 20.04 operating system, with NVIDIA Jetson AGX Orin as the hardware platform [
34]. It has 32 GB RAM, 5.32TFLOPS of Single-Precision floating-point performance, and supports CUDA 11.3.
The detailed parameters of the development and embedded platforms are shown in
Table 1:
The detection results, including three-dimensional position and speed data, are transmitted from the embedded platform to the lower-level machine via the CAN bus interface. This transmission encompasses the target’s identification number and its position and speed in three dimensions. Additionally, the lower-level machine is connected to an Inertial Measurement Unit (IMU), enabling it to calculate two-dimensional horizontal positioning information. The transmitted data are formatted in an extended frame, consisting of the target sequence number, category, three-dimensional positions (x, y, z), and velocities (x speed, y speed, z speed).