Fish Detection under Occlusion Using Modified You Only Look Once v8 Integrating Real-Time Detection Transformer Features

Li, Enze; Wang, Qibiao; Zhang, Jinzhao; Zhang, Weihan; Mo, Hanlin; Wu, Yadong

doi:10.3390/app132312645

Open AccessArticle

Fish Detection under Occlusion Using Modified You Only Look Once v8 Integrating Real-Time Detection Transformer Features

¹

School of Computer Science and Engineering, Sichuan University of Science and Engineering, Zigong 643000, China

²

School of Physics and Electronic Engineering, Sichuan University of Science and Engineering, Zigong 643000, China

³

Third Institute of Oceanography, Ministry of Natural Resources, Xiamen 361000, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2023, 13(23), 12645; https://doi.org/10.3390/app132312645

Submission received: 28 October 2023 / Revised: 18 November 2023 / Accepted: 20 November 2023 / Published: 24 November 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Fish object detection has attracted significant attention because of the considerable role that fish play in human society and ecosystems and the necessity to gather more comprehensive fish data through underwater videos or images. However, fish detection has always faced difficulties with the occlusion problem because of dense populations and underwater plants that obscure them, and no perfect solution has been found until now. To address the occlusion issue in fish detection, the following effort was made: creating a dataset of occluded fishes, integrating the innovative modules in Real-time Detection Transformer (RT-DETR) into You Only Look Once v8 (YOLOv8), and applying repulsion loss. The results show that in the occlusion dataset, the mAP of the original YOLOv8 is 0.912, while the mAP of our modified YOLOv8 is 0.971. In addition, our modified YOLOv8 also has better performance than the original YOLOv8 in terms of loss curves, F1–Confidence curves, P–R curves, the mAP curve and the actual detection effects. All these indicate that our modified YOLOv8 is suitable for fish detection in occlusion scenes.

Keywords:

fish detection; occlusion; object detection; YOLOv8; RT-DETR

1. Introduction

Fish resources, some of the most critical living resources, play a crucial role in preserving the planet’s ecological balance, providing food for human beings, and serving as a material foundation for the long-term growth of human society [1,2,3]. As a result, the world has increasingly emphasized investigating, researching, and exploiting fishes [4,5]. Fish species are diverse and numerous, so monitoring their numbers, distribution, and behavioral habits is crucial. As a result, completing fish detection in the underwater images or videos is vital, which can provide essential data for research and conservation efforts.

Fish detection using computer vision modeling presents many challenges, such as illumination variations, low contrast, high noise, fish deformation, and a dynamic background [6]. Over the years, much work has been performed to address these challenges [7,8,9]. However, occlusion has always been a difficult problem in fish detection [6]. In underwater photographs and videos, other fish or environments, such as underwater plants, occluded many fish objects, making their detection challenging. Detecting these occluded fish objects is vital for a more precise understanding of fish abundance, activity, and ecology. Unfortunately, despite years of research, there is still no perfect solution to the occlusion problem in object detection [10,11,12,13], making it extremely difficult to deal with the occlusion problem in fish detection.

Many state-of-the-art object detection models have emerged, each with unique improvements in speed and accuracy breakthroughs. You Only Look Once v8 (YOLOv8) [14] and Real-time Detection Transformer (RT-DETR) [15] are the latest state-of-the-art models. This study significantly improves the YOLOv8 detection accuracy by integrating innovative parts of RT-DETR into YOLOv8. The loss function plays a crucial role in the object detection model, and different loss functions can cope with varying detection tasks and scenarios. Repulsion Loss is a specialized loss function specifically useful for occlusion scenes. In this work, Repulsion Loss replaces the original loss function of YOLOv8 to further improve YOLOv8 performance in fish detection under occlusion scenes. Overall, these main models and methods make the following significant contributions:

First, the basic YOLOv8 model is responsible for the initial extraction of fish features in the images. Second, the integrated RT-DETR modules are responsible for encoding and decoding complex multi-scale fish features to make the model learn as many complex fish features as possible to cope with the occlusion scenarios. Finally, Repulsion Loss is used to make the model more accurately find the direction of the gradient descent on the occlusion dataset, which makes the trained model more robust in occlusion scenarios.

2. Materials and Methods

2.1. Overview of YOLOv8

YOLOv8 [14] is a new state-of-the-art object detection model offering cutting-edge performance in terms of accuracy and speed. YOLOv8 introduces new features and optimizations that build upon the advancements of previous YOLO versions, making it an ideal choice for various object detection tasks in a wide range of applications [16]. The YOLOv8 architecture builds upon the YOLO algorithm’s earlier versions, and is shown in Figure 1.

The backbone of YOLOv8 is a modified version of the CSPDarknet53 architecture, composed of 53 convolutional layers, and uses cross-stage partial connections to facilitate information flow between the different layers. This backbone is responsible for extracting features from the input images, and the network’s head is where the actual object detection takes place using extracted features to predict the class probabilities. Moreover, the bounding box coordinates each grid cell in the input image. In addition, the network’s head includes a self-attention mechanism, allowing the model to focus on the most relevant features for object detection.

Unlike previous YOLO versions, YOLOv8 is an anchor-free model that directly predicts an object’s center instead of the offset from a known anchor box. This mechanism significantly improved previous versions of YOLO, as it eliminated the need for anchor boxes, which could struggle with objects of unusual shapes or sizes. YOLOv8 employs adaptive training to optimize the learning rate and balance the loss function, leading to better model performance. This model also uses advanced data augmentation techniques, such as MixUp [18] and CutMix [19], to enhance the model’s robustness and generalization. Finally, YOLOv8′s architecture is highly customizable, allowing users to easily modify the model’s structure and parameters to suit their needs, making it a flexible tool for a wide range of object detection tasks.

2.2. Integrating RT-DETR Features into YOLOv8

RT-DETR [16] is an innovative real-time end-to-end object detection system that has achieved state-of-the-art results in object detection. Compared with earlier YOLO object detectors, RT-DETR significantly improves the speed and accuracy by leveraging a novel hybrid encoder and IoU-aware query selection. The system consists of a transformer decoder with auxiliary prediction heads, a ResNet [20]/HGNetv2 backbone, and a hybrid encoder. The community can access pre-trained models and source code to facilitate further research and adoption.

2.2.1. Overview of RT-DETR

Modern cutting-edge end-to-end RT-DETR offers real-time performance while maintaining excellent accuracy. RT-DETR effectively decouples intra-scale interaction and cross-scale fusion to process multi-scale features using the power of Vision Transformers. RT-DETR is highly adaptable, supporting the flexible adjustment of inference speed using different decoder layers without retraining. This study integrates the innovative parts of RT-DETR into YOLOv8 to improve the performance in occlusion scenarios. The architecture of RT-DETR is shown in Figure 2.

2.2.2. Integrating the Efficient Hybrid Encoder

The efficient hybrid encoder of RT-DETR transforms multi-scale features into a sequence of image features through attention-based intra-scale feature interaction (AIFI) and cross-scale feature-fusion module (CCFM). This unique Vision Transformer-based design reduces computational costs and enhances real-time object detection. Learning more sophisticated aspects of occluded objects is advantageous due to the multiscale features of efficient hybrid encoder processes by decoupling intra-scale interaction and cross-scale fusion. Some YOLOv8 convolutional layers are eliminated to use the efficient hybrid encoder and make the model lightweight, and the efficient hybrid encoder is added to the backbone.

2.2.3. Integrating IoU-aware Query Selection

Recent advancements in computer vision have led to the development of RT-DETR, a model that improves object query initialization through utilizing IoU-aware query selection. The IoU-aware query selection is employed by selecting a fixed number of image features, which serve as initial object queries for the decoder, allowing the model to focus on the most relevant objects in the scene and ultimately enhancing detection accuracy. The IoU-aware query selection is added between the backbone and YOLOv8 head to improve the YOLOv8 accuracy in occlusion scenes.

2.2.4. Integrating RT-DETR’s Decoder

The modified YOLOv8 architecture includes the addition of RT-DETR’s decoder with auxiliary prediction heads. This decoder optimizes object queries to generate boxes and confidence scores based on information output from the hybrid encoder. The efficient hybrid encoder has been used to process image features, making it vital to use the decoder for making predictions. Overall, the modified YOLOv8 architecture provides improved object detection capabilities. Figure 3 depicts the final architecture of our modified YOLOv8.

2.3. Modifying the Loss Function

Repulsion loss [21] was designed to enhance pedestrian detection performance in a crowd setting. Repulsion-by-surrounding is highly helpful, which is the major driver behind Repulsion Loss. Attraction-by-target loss alone may not be sufficient for training an optimal detector under occlusion. Repulsion loss is made up of three components, defined as follows:

R e p u l s i o n L o s s = L_{A t t r} + α * L_{R e p G T} + β * L_{R e p B o x}

(1)

where

L_{A t t r}

is the attraction term, which requires a predicted box to approach its designated target; while

L_{R e p G T}

and

L_{R e p B o x}

are the repulsion terms, which require a predicted box to keep away from other surrounding ground-truth objects and other predicted boxes with different designated targets, respectively. Coefficients

α

and

β

act as the weights to balance auxiliary losses.

Attraction Term. With the objective of narrowing the gap between predicted boxes and ground-truth boxes measured by some kind of distance metric 1, e.g., Euclidean distance [22],

{S m o o t h}_{L 1}

distance [23], or IoU [24], attraction loss has been commonly adopted in existing bounding box regression techniques. To make a fair comparison, the

{S m o o t h}_{L 1}

distance for the attraction term, as in [25,26], is adopted. The smooth parameter in

{S m o o t h}_{L 1}

is set to 2. Given the proposal

P \in P_{+}

, the ground-truth box that has the maximum IoU is assigned as its designated target:

G_{A t t r}^{P} = \arg {m a x}_{G \in G} I o U (G, P)

.

B^{P}

is the predicted box regressed from proposal

P

. Then, the attraction loss could be calculated as:

L_{A t t r} = \frac{\sum_{P \in P_{+}} {S m o o t h}_{L 1} (B^{P}, G_{A t t r}^{P})}{| P_{+} |}

(2)

Repulsion Term (RepGT). RepGT Loss is designed to repel a proposal from its neighboring ground-truth objects that are not its target. Given the proposal

P \in P_{+}

, its repulsion ground-truth object is defined as the ground-truth object with which it has the largest IoU region, except its designated target:

G_{R e p}^{P} = a r g \max_{G \in G {G_{Attr}^{P}}} I o U (G, P)

(3)

Inspired by IoU Loss in [24], RepGT Loss is calculated to penalize the overlap between

B^{P}

and

G_{R e p}^{P}

. The overlap between

B^{P}

and

G_{R e p}^{P}

is defined by Intersection over Ground-truth (IoG):

I o G (B, G) ≜ \frac{a r e a (B \cap G)}{a r e a (G)}

. As

I o G (B, G) \in [0,1]

, RepGT Loss is defined as:

L_{R e p G T} = \frac{\sum_{P \in P_{+}} {S m o o t h}_{l n} (I o G (B^{P}, G_{R e p}^{P}))}{| P_{+} |}

(4)

where

{S m o o t h}_{l n} = \{\begin{matrix} - \ln (1 - x) x \leq σ \\ \frac{x - σ}{1 - σ} - \ln (1 - σ) x > σ \end{matrix}

(5)

is a smoothed ln function that is continuously differentiable in (0, 1); and

σ \in [0,1)

is the smooth parameter to adjust the sensitiveness of the Repulsion Loss to the outliers. From Equations (4) and (5), the more a proposal tends to overlap with a non-target ground-truth object, a larger penalty will be added to the bounding box regressor by RepGT Loss. In this way, RepGT Loss could effectively stop a predicted bounding box from shifting to its neighboring objects that are not its target.

Repulsion Term (RepBox). NMS is a necessary postprocessing step in most detection frameworks to merge the primary predicted bounding boxes that are supposed to bound the same object. However, the detection results will be affected significantly by NMS, especially for the crowd cases. To make the detector less sensitive to NMS, RepBox Loss is proposed:

L_{R e p B o x} = \frac{\sum_{i \neq j} {S m o o t h}_{l n} (I o U (B^{P_{i}}, B^{P_{j}}))}{\sum_{i \neq j} i d [I o U (B^{P_{i}}, B^{P_{j}}) > 0] + ϵ}

(6)

where

i d

is the identity function and

ϵ

is a small constant in cases divided by zero. In Equation (6), to minimize RepBox Loss, the IoU region between two predicted boxes with different designated targets needs to be small. That means RepBox Loss is able to reduce the probability that the predicted bounding boxes with different regression targets are merged into one after NMS, which makes the detector more robust to crowd scenes. The original bounding box loss function of YOLOv8 is CIoU Loss, and in this paper, it is replaced with Repulsion Loss.

2.4. Experiments

2.4.1. Fish Dataset under Occlusion

We constructed an occlusion dataset of 1850 fish images containing occlusions to investigate how to deal with the occlusion problem in fish detection. This dataset included three fish species with the class names “Reticulate Dascyllus”, “Sergeant Major Fish”, and “Teira Batfish”. Other fishes or the environment occluded most fish objects. We used an online annotation tool to annotate the fish objects in the images one by one, including the occluded objects, and the number of instances of the three classes of fish was comparable. The annotation format of our dataset is YOLOv8 txt. Some of the images in the dataset are shown in Figure 4.

2.4.2. Experimental Environment

All experiments in this paper were carried out on a Dell Precision Tower 7920 workstation, which was equipped with an NVIDIA Quadro RTX 6000 GPU, and the software environment was Windows 10, Python 3.11, PyTorch 2.0.1 and CUDA 11.7. The main hardware and software environments are shown in Table 1.

2.4.3. Experimental Setup

YOLOv8 is part of the open-source computer vision project “ultralytics”. The ultralytics package can be installed by pip or conda. Because YOLOv8 needs to be modified in this paper, it is necessary to clone the ultralytics repository from GitHub. After cloning the ultralytics repository and installing its required dependencies, it is convenient to use the command-line interface (CLI) to train YOLOv8 on the occlusion dataset. YOLOv8 has a total of 5 models, namely YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l and YOLOv8x. Among these 5 models, the complexity and parameters of YOLOv8 increase in order; so, in general, YOLOv8x has the best accuracy. All these five models are trained to perform comprehensive experiments on the original YOLOv8.

For our modified YOLOv8, the architecture and loss function of YOLOv8 in the ultralytics repository needed to be modified, as mentioned above. However, the main training hyperparameters remain the same for both the initial YOLOv8 and our modified YOLOv8 version. We have conducted some training on the occlusion dataset using the original YOLOv8, and the model has the highest mAP when the optimizer is SGD and the learning rate is 0.01. The batch size and workers are kept as defaults, and the epochs hyperparameter is set to be high because of the large dataset. Table 2 shows the main training hyperparameters.

3. Results

Table 3 displays the experimental results for the five models of the initial YOLOv8. We chose the initial YOLOv8m as the next model to compare with our modified YOLOv8 model because YOLOv8m performed the best on the occlusion dataset regarding mAP. The initial YOLOv8m is referred to as “the initial YOLOv8” in all of the following instances.

3.1. Loss Curves

Figure 5 and Figure 6 show the loss curves for the initial YOLOv8 and our modified YOLOv8, respectively. The initial YOLOv8 model’s three loss curves, box_loss, cls_loss, and dfl_loss, decreased during training, and the lowest loss was obtained during training. However, the three loss curves fluctuated during the verification process. Even worse, there is an extremely high outlier in val/cls_loss, and val/dfl_loss is even trending upwards. These curves demonstrate that the original YOLOv8 model may have been overfitted on the occlusion dataset. Because the dataset is filled with a large number of occlusion problems, the features learned by the original YOLOv8 when facing these occlusion problems are not comprehensive, and so the model is prone to overfitting and has poor reasoning and generalization capabilities.

Our modified YOLOv8 loss curves for the validation set and the training set coincide with each other, and both show a stable downward trend, indicating that our modified YOLOv8 model can learn the features in the data more efficiently with a better generalization ability and can adapt to new data.

3.2. F1–Confidence Curves

The F1 score is a commonly used metric for evaluating object detection models [27]. It considers both the precision and recall of the model to compute a score between 0 and 1, with 1 being perfect precision and recall. The F1 score is calculated as the harmonic mean of precision and recall

F 1 = 2 * \frac{p r e c i s i o n * r e c a l l}{p r e c i s i o n + r e c a l l}

where:

Precision is the percentage of predicted bounding boxes that accurately match ground truth objects. It measures how many of the detections are actually correct.
Recall is the percentage of actual objects that are correctly detected. It measures how many of the ground truth objects are found.

The F1 score is a useful metric because both precision and recall are important in object detection—the model is expected to detect as many real objects as possible without too many incorrectly detected boxes. The F1 score balances these two metrics into a single value. A model with a high F1 score performs well in both avoiding false positives and not missing detections.

So, in summary, the F1 score provides a measure of the accuracy of an object detection model by combining precision and recall. It provides a balanced evaluation of the detector’s performance. Maximizing the F1 score leads to models that are reliable in detecting objects correctly without many mistakes.

The F1–Confidence curve is a metric used to evaluate the performance of object detection models [28]. It shows the relationship between the confidence threshold and the F1 score. A high F1 score at a high confidence indicates a good detector. The area under the F1–Confidence curve summarizes the overall detection performance.

Figure 7a,b show the F1–Confidence curves for the initial YOLOv8 and our modified YOLOv8, respectively. From them, our modified YOLOv8 has produced better outcomes than the original YOLOv8 in the following areas:

(1).: Higher peak F1 score. The maximum F1 score indicates the model’s best performance. The initial YOLOv8′s maximum F1 score is 0.86, whereas ours is 0.95, indicating an improvement of 0.09.
(2).: Wider curve and flatter peak. A wider curve and flatter peak indicates the F1 score is stable across different thresholds, revealing a more robust model. A narrow peak means high sensitivity to thresholds. Our model’s F1–Confidence curve has a wider curve and flatter peak than the initial YOLOv8.
(3).: Larger area under the curve. The area under the F1–Confidence curve summarizes performance across all thresholds. A larger area indicates a better model.

3.3. P–R Curves

A Precision–Recall (P–R) curve comprehensively evaluates an object detector and provides insights into precision–recall tradeoffs for different use cases. It is commonly used along with metrics like mean average precision to compare object detectors [29].

Here are some key properties of a P–R curve:

The higher the curve, the better the detector’s accuracy at different recall levels.
The closer the curve is to the top right corner, the better the detector’s overall precision and recall.
The area under the curve (AUC) summarizes the detector’s expected precision over all recall levels. A higher AUC indicates a better detection performance.

Figure 8a,b show the P–R curves for the original YOLOv8 and our modified YOLOv8, respectively. Our modified YOLOv8 P–R curve is higher and closer to the top right corner than that of the initial YOLOv8. Moreover, our model has a higher AUC than the initial YOLOv8, indicating better performance of our model.

The AUC is the mean Average Precision (mAP) in a P–R curve. The mAP of the original YOLOv8 is 0.912, while ours is 0.971, an improvement of 0.059.

3.4. mAP Curves

As a key performance metric for object detection models, mAP (mean Average Precision) measures the mean of the maximum precisions at different recall values over all classes [30]. It balances both precision and recall to provide a single numeric performance indicator.

More formally, average precision (AP) for a single class is calculated as the area under the precision–recall curve. AP sums the precision values at all recall levels using numerical integration:

A P = \int p (r) d r

mAP takes the mean of AP values across all classes. So, it summarizes model performance across all classes into one number.

m A P = \frac{1}{n c} \sum A P

where

n c

is the number of all classes.

Figure 9a,b show the mAP curves for the initial YOLOv8 and our modified YOLOv8, respectively. In the training process of the initial YOLOv8, the mAP curve does not show a stable upward trend, but rather an undulating situation. During the training process of our modified YOLOv8, the mAP curve does not show a stable upward trend, but a steady upward trend, and finally stabilizes at the highest value.

3.5. Detection Effects

Figure 10, Figure 11 and Figure 12 show the actual detection effects for the three images in the test set. Panels (a) and (b) represent the predictions of the original YOLOv8 and our modified YOLOv8, respectively. Table 4 counts how many fish objects were predicted by these two models. They clearly show that our modified YOLOv8 can detect more fish objects in occlusion scenes.

4. Discussion

Whether looking at the curves representing model performance, such as loss curves, F1–Confidence curves, P–R curves and mAP curves, or looking at the actual detection results, our modified YOLOv8 performs better than the original YOLOv8. This shows that our modifications to YOLOv8 are effective. The efficient hybrid encoder of RT-DETR can enable the model to learn richer multi-scale fish features, and the IoU-aware query selection can make it easier for the model to find the target fishes in the images. Coupled with Repulsion Loss, it allows the model to better find the direction of gradient descent when training on the occlusion dataset. Based on the above reasons, our modified YOLOv8 performs better than the original YOLOv8 on the occlusion dataset.

Compared with the original YOLOv8, our modified YOLOv8 is more suitable for practical engineering applications because it has higher accuracy and performance. However, the model we have trained so far is not enough to be deployed in a large-scale industrial project, because there are only three kinds of fish in its training set, so the model we have currently trained can only detect three kinds of fish, which is difficult for the detection of more occluded fish in the ocean environment. Therefore, if we want to deploy our model to actual industrial projects, we need to build a dataset containing more types of occluded fish and use our modified YOLOv8 to train a model that can detect more occluded fish.

5. Conclusions

This study aims to address the occlusion problem in fish detection. We created a dataset composed of occluded fishes and many occlusions to achieve this objective. We also integrated some innovative modules from RT-DETR into YOLOv8, which significantly expanded the YOLOv8 network to learn more comprehensive features of fish objects. This approach enables YOLOv8 to better deal with the occlusion problem. We adopted a loss function specifically designed for occlusion scenes—Repulsion Loss—to train the model in the occlusion dataset better.

We performed many comparative experiments, and the results showed that our modified YOLOv8 had a higher F1 score and mAP than the original YOLOv8 for our occlusion dataset. Our modified YOLOv8 outperformed the original YOLOv8 from the perspective of various curves and detection effects. We hope our work will be useful for fish detection in occlusion scenes.

Author Contributions

Conceptualization, Q.W.; data curation, E.L. and H.M.; formal analysis, J.Z. and W.Z.; funding acquisition, J.Z. and Y.W.; investigation, E.L.; methodology, Q.W.; resources, W.Z.; supervision, Y.W.; validation, H.M.; writing—original draft, E.L. and Q.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded partly by the Technological Innovation Project of Laoshan Laboratory (Grant No. LSKJ202202901) and Science and Technology Planning Project of Fujian Province (Grant No. 2022H0035).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflict of interest.

References

Marushka, L.; Batal, M.; Tikhonov, C.; Sadik, T.; Schwartz, H.; Ing, A.; Chan, H.M. Importance of fish for food and nutrition security among First Nations in Canada. Can. J. Public Health 2021, 112, 64–80. [Google Scholar] [CrossRef]
Hiddink, J.G.; MacKenzie, B.R.; Rijnsdorp, A.; Dulvy, N.K.; Nielsen, E.E.; Bekkevold, D.; Ojaveer, H. Importance of fish biodiversity for the management of fisheries and ecosystems. Fish. Res. 2008, 90, 6–8. [Google Scholar] [CrossRef]
Ditria, E.M.; Lopez-Marcano, S.; Sievers, M.; Jinks, E.L.; Brown, C.J.; Connolly, R.M. Automating the analysis of fish abundance using object detection: Optimizing animal ecology with deep learning. Front. Mar. Sci. 2020, 7, 429. [Google Scholar] [CrossRef]
Raza, K.; Song, H. Fast and accurate fish detection design with improved YOLO-v3 model and transfer learning. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 1–16. [Google Scholar] [CrossRef]
Al Muksit, A.; Hasan, F.; Emon, M.F.H.B.; Haque, M.R.; Anwary, A.R.; Shatabda, S. YOLO-Fish: A robust fish detection model to detect fish in realistic underwater environment. Ecol. Inform. 2022, 72, 101847. [Google Scholar] [CrossRef]
Yang, L.; Liu, Y.; Yu, H.; Fang, X.; Song, L.; Li, D.; Chen, Y. Computer vision models in intelligent aquaculture with emphasis on fish detection and behavior analysis: A review. Arch. Comput. Methods Eng. 2021, 28, 2785–2816. [Google Scholar] [CrossRef]
Salman, A.; Siddiqui, S.A.; Shafait, F.; Mian, A.; Shortis, M.R.; Khurshid, K.; Schwanecke, U. Automatic fish detection in underwater videos by a deep neural network-based hybrid motion learning system. ICES J. Mar. Sci. 2020, 77, 1295–1307. [Google Scholar] [CrossRef]
Chen, L.; Zang, Z.; Huang, T.; Li, Z. Marine fish object detection based on YOLOv5 and attention mechanism. In Proceedings of the 2022 IEEE Smartworld, Ubiquitous Intelligence & Computing, Scalable Computing & Communications, Digital Twin, Privacy Computing, Metaverse, Autonomous & Trusted Vehicles (SmartWorld/UIC/ScalCom/DigitalTwin/PriComp/Meta), Haikou, China, 15–18 December 2022; pp. 1252–1258. [Google Scholar]
Knausgård, K.M.; Wiklund, A.; Sørdalen, T.K.; Halvorsen, K.T.; Kleiven, A.R.; Jiao, L.; Goodwin, M. Temperate fish detection and classification: A deep learning based approach. Appl. Int. 2022, 52, 6988–7001. [Google Scholar] [CrossRef]
Chandel, H.; Vatta, S. Occlusion Detection and Handling: A Review. Int. J. Comput. Appl. 2015, 120, 33–38. [Google Scholar] [CrossRef]
Gilroy, S.; Jones, E.; Glavin, M. Overcoming occlusion in the automotive environment—A review. IEEE Trans. Intell. Transp. Syst. 2019, 22, 23–35. [Google Scholar] [CrossRef]
Wang, A.; Sun, Y.; Kortylewski, A.; Yuille, A.L. Robust object detection under occlusion with context-aware compositionalnets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 13–19 June 2020; pp. 12645–12654. [Google Scholar]
Saleh, K.; Szenasi, S.; Vamossy, Z. Occlusion handling in generic object detection: A review. In Proceedings of the 2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI), Herl’any, Slovakia, 21–23 January 2021. [Google Scholar]
Available online: https://yolov8.com (accessed on 7 September 2023).
Lv, W.; Xu, S.; Zhao, Y.; Wang, G.; Wei, J.; Cui, C.; Liu, Y. DETRs Beat YOLOs on Real-time Object Detection. arXiv 2023, arXiv:2304.08069. [Google Scholar]
Terven, J.; Cordova-Esparza, D. A comprehensive review of YOLO: From YOLOv1 to YOLOv8 and beyond. arXiv 2023, arXiv:2304.00501. [Google Scholar]
Available online: https://github.com/ultralytics/ultralytics/issues/189 (accessed on 7 September 2023).
Zhang, H.; Cisse, M.; Dauphin, Y.; Lopez-Paz, D. Mixup: Beyond Empirical Risk Minimization. Learning, Learning. arXiv 2017, arXiv:1710.094122017. [Google Scholar]
Yun, S.; Han, D.; Chun, S.; Oh, S.J.; Yoo, Y.; Choe, J. CutMix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Wang, X.; Xiao, T.; Jiang, Y.; Shao, S.; Sun, J.; Shen, C. Repulsion loss: Detecting pedestrians in a crowd. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T. Unitbox: An advanced object detection network. In Proceedings of the 2016 ACM on Multimedia Conference, Amsterdam, The Netherlands, 15–19 October 2016; pp. 516–520. [Google Scholar]
Mao, J.; Xiao, T.; Jiang, Y.; Cao, Z. What can help pedestrian detection? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Zhang, S.; Benenson, R.; Schiele, B. Citypersons: A diverse dataset for pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Zhao, L.; Li, S. Object detection algorithm based on improved YOLOv3. Electronics 2020, 9, 537. [Google Scholar] [CrossRef]
Rohit, C.; Sant, R.; Nawade, S.; Vedhant, V.; Gupta, V. Exploring edge artificial intelligence: A comparative study of computing devices for deployment of object detection algorithm. In Proceedings of the IEEE 2023 4th International Conference for Emerging Technology (INCET), Belgaum, India, 26–28 May 2023; pp. 1–5. [Google Scholar]
Padilla, R.; Netto, S.L.; Da Silva, E.A. A survey on performance metrics for object-detection algorithms. In Proceedings of the IEEE 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), Niterói, Brazil, 3–5 June 2020; pp. 237–242. [Google Scholar]
Padilla, R.; Passos, W.L.; Dias, T.L.B.; Netto, S.L.; da Silva, E.A.B. A Comparative Analysis of Object Detection Metrics with a Companion Open-Source Toolkit. Electronics 2021, 10, 279. [Google Scholar] [CrossRef]

Figure 1. The architecture of YOLOv8 [17].

Figure 2. The architecture of RT-DETR.

Figure 3. The architecture of our modified YOLOv8.

Figure 4. The fish dataset under occlusion.

Figure 5. The loss curves of the initial YOLOv8.

Figure 6. The loss curves of our modified YOLOv8.

Figure 7. The F1–Confidence curves for the original YOLOv8 (a) and our modified YOLOv8 (b).

Figure 8. The P–R curves for the original YOLOv8 (a) and our modified YOLOv8 (b).

Figure 9. The mAP curves for the original YOLOv8 (a) and our modified YOLOv8 (b). The blue line consists of the actual mAP values, which represents the change of the real mAP with the change of epoch; the yellow dots are the results of smoothing by the blue line, reflecting the general trend of the mAP curve.

Figure 10. Prediction 1. (a) Twelve bounding boxes. (b) Fourteen bounding boxes.

Figure 11. Prediction 2. (a) Eleven bounding boxes. (b) Sixteen bounding boxes.

Figure 12. Prediction 3. (a) Thirteen bounding boxes. (b) Seventeen bounding boxes.

Table 1. Experimental Environment.

Machine	Dell Precision Tower 7920 workstation
GPU	NVIDIA Quadro RTX 6000
OS	Windows 10
Python	3.11
PyTorch	2.0.1
CUDA	11.7

Table 2. Main training hyperparameters.

epochs	700
batch size	16
workers	8
optimizer	SGD
learning rate	0.01

Table 3. The results from the initial YOLOv8.

Models	Val mAP50	Epochs Completed
YOLOv8n	0.895	278
YOLOv8s	0.907	147
YOLOv8m	0.912	120
YOLOv8l	0.890	149
YOLOv8x	0.899	266

Table 4. Number of fishes predicted by the 2 models in some images from the test set.

Image	The Original YOLOv8	Our Modified YOLOv8
Image 1	12	14
Image 2	11	16
Image 3	13	17

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, E.; Wang, Q.; Zhang, J.; Zhang, W.; Mo, H.; Wu, Y. Fish Detection under Occlusion Using Modified You Only Look Once v8 Integrating Real-Time Detection Transformer Features. Appl. Sci. 2023, 13, 12645. https://doi.org/10.3390/app132312645

AMA Style

Li E, Wang Q, Zhang J, Zhang W, Mo H, Wu Y. Fish Detection under Occlusion Using Modified You Only Look Once v8 Integrating Real-Time Detection Transformer Features. Applied Sciences. 2023; 13(23):12645. https://doi.org/10.3390/app132312645

Chicago/Turabian Style

Li, Enze, Qibiao Wang, Jinzhao Zhang, Weihan Zhang, Hanlin Mo, and Yadong Wu. 2023. "Fish Detection under Occlusion Using Modified You Only Look Once v8 Integrating Real-Time Detection Transformer Features" Applied Sciences 13, no. 23: 12645. https://doi.org/10.3390/app132312645

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fish Detection under Occlusion Using Modified You Only Look Once v8 Integrating Real-Time Detection Transformer Features

Abstract

1. Introduction

2. Materials and Methods

2.1. Overview of YOLOv8

2.2. Integrating RT-DETR Features into YOLOv8

2.2.1. Overview of RT-DETR

2.2.2. Integrating the Efficient Hybrid Encoder

2.2.3. Integrating IoU-aware Query Selection

2.2.4. Integrating RT-DETR’s Decoder

2.3. Modifying the Loss Function

2.4. Experiments

2.4.1. Fish Dataset under Occlusion

2.4.2. Experimental Environment

2.4.3. Experimental Setup

3. Results

3.1. Loss Curves

3.2. F1–Confidence Curves

3.3. P–R Curves

3.4. mAP Curves

3.5. Detection Effects

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI