4.2. Experimental Equipment and Indicator of Evaluation
We used the Windows 11 operating system in our experiments. The CPU used was an i9-13900K, the GPU was an Nvidia GeForce RTX 4090 24G, and PyTorch2.0.0, CUDA12.1, and CUDNN8.9.0 were the deep learning frameworks. To ensure fairness in the experiments, the hyperparameters of each group were set to be the same in the training phase. Their settings are shown in
Table 1.
The loss function curves of the proposed method are demonstrated in
Figure 8, which contain three parts: localization loss, distribution focal loss, and classification loss. As shown by panels (a) and (b) in
Figure 8, the three loss functions for EF-UODA and YOLOv8X reached convergence within 100 training epochs. Therefore, the model converged in fewer than 100 epochs.
As the proposed deep neural algorithm was end to end, we compared it with a prevalent end-to-end deep neural algorithm in our experiments. As mentioned previously, we used the URPC dataset to evaluate our model, and reported its values of AP50 (i.e., the mean average precision (mAP) at an IoU threshold of 0.5) under the COCO metric as the main indicator of evaluation.
The mAP calculated for objects in n categories was performed as follows:
where
TP refers to the number of positive samples that are correctly detected (i.e., true positives),
FP refers to the number of positive samples that are not correctly detected (i.e., false positives), and
FN refers to the number of negative samples that are not correctly detected (i.e., false negatives).
4.3. Ablation and Comparison Experiments
For the exploration of underwater object detection algorithms, the main goal of this paper is to explore object detection algorithms with higher accuracy while maintaining real-time detection. The minimization of the algorithm GFLOPs and parameters will be explored when the above conditions are achieved. Therefore, under the premise of satisfying real-time detection, the accuracy of the algorithm is the most prioritized index to judge the performance of the algorithm.
To comprehensively assess the effectiveness of each scheme for EF-UODA, we conducted ablation experiments on the URPC dataset; the results are shown in
Table 2. Replacing CSPDarknet with ViT-based backbone NexT effectively reduced the computational effort of the algorithm and made the algorithm more focused on contextual information, enhanced the generalization ability, and effectively ensured detection accuracy. Although the extra prediction head P2 and M2F-FPN increased the GFLOPs of the algorithm, the improvement in the accuracy of the algorithm is so significant (1.5% and 1.4% improvement in accuracy, respectively) that we believe it is worthwhile. From
Table 2, it could be clearly seen that MPDIoU well addressed the problem of the vastly different scales of underwater images, which was very conducive to the URPC dataset, where the image scales change drastically.
We conducted ablation and comparison experiments to demonstrate the efficiency of the proposed feature extraction module C3-EMPC; the results are shown in
Table 3. Additionally, the results of the experiments are depicted in
Figure 9, facilitating the observation of data variations. Once the convolutional module of C3 was replaced with the EMPC module, the number of GigaFLOPs (GFLOPs) of the algorithm decreased from 142.3 to 140.0, the number of parameters used decreased from 34.81 M to 34.69 M, its frame per second (FPS) improved from 78.74 to 80.65, and its value of AP50 increased from 86.2% to 86.9%.
We integrated a few high-performing methods into the C3 and C2f modules for the comparison experiments: C3-CloAtt [
46], C3-ScConv [
47], C3-SCConv [
48], C2f-Faster [
49], C2f-DBB [
50], and C2f-ODConv [
51]. C3-EMPC was slightly slower than C2f and C2f-Faster but its accuracy was almost 1% higher. The number of GFLOPs of C3-EMPC was slightly larger than those of C3-ScConv, C2f-Faster, and C2f-ODConv, but its speed of detection was also 1.55 and 1.40 times higher than those of C3-ScConv and C2f-ODConv, respectively, while its accuracy in terms of AP50 was 0.9% higher. The results of the ablation and comparison experiments in
Table 3 demonstrate the efficiency of the C3-EMPC module. It provided the greatest improvement in the accuracy of the underwater object detection algorithms while better balancing the numbers of FLOPs and parameters as well as the FPS of the algorithm than the other modules.
To demonstrate the effectiveness of M2F-FPN, we compared it with the PANet and BiFPN architectures in experiments that used the same dataset and hyperparameters; the results are shown in
Table 4. The AP50 value of the algorithm using M2F-FPN was higher by 0.3% and 1.2% than achieved with the PANet and BiFPN architectures, respectively. Although there was a slight increase in the number of FLOPs and parameters of the algorithm and a slight decrease in its speed of detection (FPS), we believed that this tradeoff was worthwhile in light of its higher accuracy. Our proposed M2F-FPN, which used both fast fusion and concat feature fusion, strengthened the multi-scale feature fusion of the overall algorithm and improved its accuracy.
We also conducted an ablation experiment to evaluate the contribution of the number of channels of the algorithm to its overall detection-related performance; the results are shown in
Table 5. The EMPC module required dividing the number of input channels into four groups. We set the minimum number of input channels of the module to 64 to ensure the effectiveness of feature extraction. Given that a module was available to vary the scale of channels in the algorithm, the minimum number of channels was set to 128 to ensure the appropriate operation of the proposed deep neural algorithm. It was clear from the results in
Table 5 that increasing the number of channels did not improve the accuracy of the algorithm. Its number of GFLOPs was 232.9 with 256 channels, and further increasing the number of channels was impractical when using a single GPU to train it. The algorithm had the smallest number of FLOPs and parameters with 128 channels, while the model recorded the highest accuracy of detection (AP50) of 86.9% on the test set while delivering the best generalization-related performance.
The loss function is an important component of deep neural algorithms that assess the predictions of the model. A well-defined loss function for the bounding box can significantly improve model performance. To determine the effectiveness of integrating MPDIoU as the loss function of the bounding box into the proposed EF-UODA, we compared its effect on the performance of our method with seven state-of-the-art loss functions on the same test set: CIoU, DIoU [
52], GIoU [
53], EIoU [
54], SIoU [
55], AlphaIoU [
56], and Wise-IoU [
57]. The experimental results are shown in
Table 6 and
Figure 10, from which it is clear that the algorithm delivered the best performance in terms of both accuracy and FPS when MPDIoU was used as the loss function of the bounding box. It recorded an AP50 of 86.9%, which was higher than that of GIoU (the second-best method) by 0.4%. Its FPS was 80.65, which was 1.15 times higher than that of DIoU (the second-best method). MPDIoU thus improved the FPS and accuracy of EF-UODA.
4.4. Comparison with Other Algorithms
We compared EF-UODA with SSD and Faster R-CNN on the same test set to verify its detection accuracy. SSD and Faster R-CNN use VGG16 and ResNet50 as their backbone, respectively, and the experimental results are shown in
Table 7. Further, the trend of the speed indicator FPS is given in
Table 7. Additionally, the results of the experiments are depicted in
Figure 11, facilitating the observation of data variations.The mAP represents the average of all 10 IoU thresholds in the range [0.5:0.95]. EF-UODA recorded an accuracy that was 25.8% and 11.1% higher than those of the SSD and Faster R-CNN at an IoU of 0.5, respectively. Its accuracy in terms of the mAP metric was 20.4% and 8.0% higher than those of SSD and Faster R-CNN, respectively. This demonstrated that our algorithm was significantly more accurate than the classical single-stage object detection algorithm SSD and the two-stage Faster R-CNN.
We compared with the state-of-the-art ViT-based object detection RT-DETR [
58] under the same conditions to illustrate the effectiveness of our proposed ViT-based EF-UODA; the results are shown in
Table 7. RT-DETR used resnet-101 as its backbone. The accuracy of EF-UODA compared to RT-DETR in terms of AP50 and mAP was higher by 0.9% and 2.1%, respectively. EF-UODA’s GFLOPs and parameters only accounted for 56.66% and 46.46%, respectively, of those used by RT-DETR. Although the FPS of EF-UODA was 23.52 lower than that of RT-DETR, the FPS of EF-UODA achieved 80.69, which was sufficient for real-time detection. Considering that designing an underwater object detection algorithm with higher accuracy is the core purpose of this paper, and synthesizing the GFLOPs and parameters, EF-UODA is a better underwater object detection algorithm.
To demonstrate that our algorithm could accurately detect underwater objects, we compared it with SOTA one-stage object detection algorithms YOLOv5X, YOLOv7X, and YOLOv8X for the same test set. All four algorithms were based on the PyTorch framework, and the results are shown in
Table 7. EF-UODA used 74.47% and 49% fewer GFLOPs and parameters than YOLOv7X, respectively, with an FPS that was 23.18 higher. Its accuracy in terms of AP50 and mAP was also higher by 5.2% and 6.9%, respectively. Moreover, EF-UODA used 68.69% and 40.25% fewer GFLOPs and parameters than YOLOv5X, and its accuracy in terms of AP50 and mAP was higher by 3.9% and 4.0%, respectively. We think that this increase in accuracy was worthwhile, despite the reduction of 25.73 in its FPS. Its FPS improved by 8.71 compared with the SOTA one-stage object detection algorithm YOLOv8X, while its accuracy in terms of AP50 and mAP was higher by 4.4% and 2.9%, respectively. It also used 54.39% and 50.92% fewer GFLOPs and parameters, respectively, than YOLOv8X.
We concluded that our proposed EF-UODA more accurately detected underwater objects than other SOTA object detection algorithms while reducing the number of FLOPs and parameters and ensuring a high speed of detection.
4.5. Comparison of the Detection Results
We randomly selected several images from the test set for experiments on underwater object detection by using YOLOv5X, YOLOv7X, YOLOv8X, and our proposed EF-UODA algorithm. The results are shown in
Figure 12.
It is clear from
Figure 12 that the dataset selected for the experiments consisted of fuzzy images, which made it difficult to detect objects in them because they were extracted from a real video captured underwater. This dataset, which contained complex backgrounds, thus imposed stringent requirements on the detection and feature extraction capabilities of the algorithm. YOLOv5X, YOLOv7X, and YOLOv8X omitted more information and made more incorrect predictions than the proposed method. In contrast, the EF-UODA algorithm exhibited better feature extraction capability through the C3-EMPC module, which was based on the idea of the grouping operation and the use of a multi-path fast fusion-based FPN for multi-scale feature fusion. It was thus able to detect underwater objects more accurately than the prevalent one-stage object detection algorithms YOLOv5X, YOLOv7X, and YOLOv8X.
To prove the real effectiveness of the algorithm, we intercepted two images from the underwater video and reasoned through the proposed algorithm, and the results are shown in
Figure 13.
Figure 13a has five sea urchins and three sea stars, and
Figure 13b has four sea urchins. It could be clearly seen that EF-UODA successfully reasoned out all the objects to be tested in the figure and gave a high confidence level, proving the effectiveness of the algorithm in the real world.