Figure 1.
The floating-waste dataset. (a) An image from FloW-img dataset. (b) An image from FloatingWaste-I dataset.
Figure 1.
The floating-waste dataset. (a) An image from FloW-img dataset. (b) An image from FloatingWaste-I dataset.
Figure 2.
Percentage of large, medium and small objects in the FloW-img dataset, FloatingWaste-I dataset and MS COCO dataset.
Figure 2.
Percentage of large, medium and small objects in the FloW-img dataset, FloatingWaste-I dataset and MS COCO dataset.
Figure 3.
YOLOv5 detection results on FloW-img with a heat map, with similar features of light spots and reflections being mistaken for objects.
Figure 3.
YOLOv5 detection results on FloW-img with a heat map, with similar features of light spots and reflections being mistaken for objects.
Figure 4.
The YOLO-Float framework. The network consists of three parts: backbone, neck and head. In the backbone, the block consisting of CONV, CSP and SPP is used to extract the features (the composition of CONV, CSP and SPP is shown in the figure). After that, the low-level representation-enhancement module is used to ensure that features of small objects are not lost when downsampling in the backbone. Meanwhile, the attention-fusion module is used to fuse highest-resolution and lowest-resolution feature maps. Finally, the head predicts the results of the region classification. And, the other three heads predict the results of the classification and bounding-box regression. Since each grid on the feature map is preset with three anchors of different aspect ratios, for each anchor, it predicts the horizontal and vertical coordinates of its center point, width, height, confidence level and the class to which it belongs, so the number of feature channels for the three heads is .
Figure 4.
The YOLO-Float framework. The network consists of three parts: backbone, neck and head. In the backbone, the block consisting of CONV, CSP and SPP is used to extract the features (the composition of CONV, CSP and SPP is shown in the figure). After that, the low-level representation-enhancement module is used to ensure that features of small objects are not lost when downsampling in the backbone. Meanwhile, the attention-fusion module is used to fuse highest-resolution and lowest-resolution feature maps. Finally, the head predicts the results of the region classification. And, the other three heads predict the results of the classification and bounding-box regression. Since each grid on the feature map is preset with three anchors of different aspect ratios, for each anchor, it predicts the horizontal and vertical coordinates of its center point, width, height, confidence level and the class to which it belongs, so the number of feature channels for the three heads is .
Figure 5.
Low-level representation-enhancement module. Note that the symbol C represents the concatenation that links the channels of two feature maps of the same size. BN represents the batch normalization, which has the effect of preventing overfitting and accelerating convergence. SiLU represents the nonlinear activation function Sigmoid Linear Unit, , where is the logistic sigmoid.
Figure 5.
Low-level representation-enhancement module. Note that the symbol C represents the concatenation that links the channels of two feature maps of the same size. BN represents the batch normalization, which has the effect of preventing overfitting and accelerating convergence. SiLU represents the nonlinear activation function Sigmoid Linear Unit, , where is the logistic sigmoid.
Figure 6.
Example of attention-fusion module. Note that the G represents the feature of the feature map. The A, B, C, D, E, F represent the feature of the feature map. Broadcasting the G feature to the whole feature map is the process of fusion of the two feature maps.
Figure 6.
Example of attention-fusion module. Note that the G represents the feature of the feature map. The A, B, C, D, E, F represent the feature of the feature map. Broadcasting the G feature to the whole feature map is the process of fusion of the two feature maps.
Figure 7.
Attention-fusion module. We still follow the terminology of three different vectors in transformer. q: query vector. k: vector representing the relevance of the queried information to other information. v: vector representing the queried information. The output of each row results in the shape of a tensor, where denotes the product of and . The symbol “×” denotes matrix multiplication and the symbol “C” denotes the concatenation.
Figure 7.
Attention-fusion module. We still follow the terminology of three different vectors in transformer. q: query vector. k: vector representing the relevance of the queried information to other information. v: vector representing the queried information. The output of each row results in the shape of a tensor, where denotes the product of and . The symbol “×” denotes matrix multiplication and the symbol “C” denotes the concatenation.
Figure 8.
Acquisition platform and environment. (a) The unmanned surface vehicle and DJI Pocket2 motion camera. (b) The scenario and time of collection phase.
Figure 8.
Acquisition platform and environment. (a) The unmanned surface vehicle and DJI Pocket2 motion camera. (b) The scenario and time of collection phase.
Figure 9.
Detection results of different models. (a) Ground truth, (b) YOLOv5, (c) YOLOv5 + LREM, (d) YOLOv5 + AFM, (e) YOLO-Float.
Figure 9.
Detection results of different models. (a) Ground truth, (b) YOLOv5, (c) YOLOv5 + LREM, (d) YOLOv5 + AFM, (e) YOLO-Float.
Figure 10.
The loss curves of the model YOLOv5 + LREM in the training phase. (a) LREM-classification loss curve and object-detection loss curve, which can be seen to have a similar decreasing trend. (b) Training loss curve and validation loss curve.
Figure 10.
The loss curves of the model YOLOv5 + LREM in the training phase. (a) LREM-classification loss curve and object-detection loss curve, which can be seen to have a similar decreasing trend. (b) Training loss curve and validation loss curve.
Figure 11.
Scatterplot of LREM-region classification accuracy and object-detection mAP in the training phase. The points in the scatter plot correspond to the LREM-classification metric accuracy and object-detection metric MAP under the weights obtained from the epochs in the training phase.
Figure 11.
Scatterplot of LREM-region classification accuracy and object-detection mAP in the training phase. The points in the scatter plot correspond to the LREM-classification metric accuracy and object-detection metric MAP under the weights obtained from the epochs in the training phase.
Figure 12.
Detection results of different algorithms. (a) The input image, (b) the result of YOLOR, (c) the result of YOLOX, (d) the result of YOLOv7, (e) the result of proposed method.
Figure 12.
Detection results of different algorithms. (a) The input image, (b) the result of YOLOR, (c) the result of YOLOX, (d) the result of YOLOv7, (e) the result of proposed method.
Figure 13.
Detection results of YOLO-Float on FloatingWaste-I. (a) represents the object occlusion, (b) represents the evening scene.
Figure 13.
Detection results of YOLO-Float on FloatingWaste-I. (a) represents the object occlusion, (b) represents the evening scene.
Table 1.
Comparison of public datasets on water-surface object detection in the number of annotated frames, type of detection object, number of object classes, resolution of image, type of environment, view of USV and condition of light. Environment: M—marine, I—inland. USV-based: Y—yes, N—no. Condition: L—light, D—dark.
Table 1.
Comparison of public datasets on water-surface object detection in the number of annotated frames, type of detection object, number of object classes, resolution of image, type of environment, view of USV and condition of light. Environment: M—marine, I—inland. USV-based: Y—yes, N—no. Condition: L—light, D—dark.
Datasets | Year | Frames | Object | Classes | Resolution | Env. | USV-Based | Condition |
---|
MODD [19] | 2016 | 4454 | obstacle | 4 | | M | Y | L |
SMD [21] | 2017 | 16,000 | boat | 10 | | M | N | L&D |
MODD2 [20] | 2018 | 11,675 | obstacle | 2 | | M | Y | L |
FloW-img [11] | 2021 | 2000 | waste | 1 | | I | Y | L |
FloatingWaste-I | 2023 | 1867 | waste | 2 | | I | Y | L&D |
Table 2.
Ablation experiments on FloW-img.
Table 2.
Ablation experiments on FloW-img.
Method | | | | | | | | | |
---|
YOLOv5 | 41.8% | 35.3% | 26.7% | 62.4% | 80.1% | 48.8% | 36.1% | 68.9% | 83.2% |
YOLOv5 + LREM | 42.3% | 40.3% | 28.2% | 62.4% | 79.3% | 49.6% | 37.1% | 69.6% | 83.2% |
YOLOv5 + AFM | 43.8% | 40.3% | 29.2% | 63.4% | 82.9% | 51% | 38.9% | 69.9% | 85.9% |
YOLO-Float | 44.2% | 42% | 30.7% | 62.8% | 82.6% | 51.4% | 39.8% | 69.4% | 85.9% |
Table 3.
Ablation experiments on FloW-img.
Table 3.
Ablation experiments on FloW-img.
Method | | | | | | | |
---|
YOLOv5 | 41.8% | 82.9% | 35.3% | 26.7% | 80.1% | 36.1% | 83.2% |
YOLOv5 + Upsample | 43% | 84% | 39.1% | 28.5% | 82.2% | 37.8% | 84.1% |
YOLOv5 + AFM | 43.8% | 83.5% | 40.3% | 29.2% | 82.9% | 38.9% | 85.9% |
Table 4.
Comparison experiment on FloW-img.
Table 4.
Comparison experiment on FloW-img.
ALGORITHM | | | | | | | Param | FPS |
---|
YOLOv5-X | 41.8% | 82.9% | 35.3% | 26.7% | 48.8% | 36.1% | 83.2 M | 38 |
YOLOR-D6 | 41% | 77% | 39.1% | 28.6% | 48.4% | 34.9% | 151.7 M | 34 |
YOLOX-X | 41.5% | 80.5% | 38.4% | 27.6% | 47.6% | 34.2% | 99.1M | 58 |
YOLOv7-E6E | 40.8% | 81.8% | 36.9% | 28% | 40% | 39% | 110.3 M | 36 |
YOLO-Float(-X) | 44.2% | 83.3% | 42% | 30.7% | 51.4% | 39.8% | 91.8 M | 35 |
Faster R-CNN [11] | 18.4% | | | | | | | 9.3 |
Cascade R-CNN [11] | 43.4% | | | | | | | 3.9 |
DSSD [11] | 27.5% | | | | | | | 28.6 |
RetinaNet [11] | 24.9% | | | | | | | 7.6 |
FPN [11] | 18.4% | | | | | | | 7.4 |
Table 5.
Comparison experiment on FloatingWaste-I.
Table 5.
Comparison experiment on FloatingWaste-I.
ALGORITHM | | | | | | |
---|
YOLOv5-X | 34.8% | 82.9% | 22.9% | 20.8% | 41.9% | 31.4% |
YOLOR-D6 | 26.3% | 51% | 22% | 9.1% | 31.1% | 13.5% |
YOLOX-X | 35.3% | 81.5% | 26.7% | 21.1% | 39.3% | 26.6% |
YOLOv7-E6E | 30.4% | 70.5% | 20.6% | 20.2% | 43.4% | 36.2% |
YOLO-Float(-X) | 37.8% | 82.5% | 30.4% | 23.6% | 44.4% | 32.8% |
Table 6.
Comparison experiment on FloatingWaste-I with different class.
Table 6.
Comparison experiment on FloatingWaste-I with different class.
| Bottle | Carton |
---|
ALGORITHM | | | | |
---|
YOLOv5-X | 32.9% | 77.8% | 36.3% | 87.1% |
YOLOvR-D6 | 26.9% | 51.4% | 29.4% | 57.2% |
YOLOvX-X | 32.8% | 75.9% | 37.7% | 86% |
YOLOv7-E6E | 30.8% | 73.3% | 28.7% | 68.3% |
YOLO-Float(-X) | 38.6% | 80.7% | 36.8% | 83.3% |