Detection and Tracking of Underwater Fish Using the Fair Multi-Object Tracking Model: A Comparative Analysis of YOLOv5s and DLA-34 Backbone Models

Lee, Sang-Hyun; Oh, Myeong-Hoon

doi:10.3390/app14166888

Open AccessArticle

Detection and Tracking of Underwater Fish Using the Fair Multi-Object Tracking Model: A Comparative Analysis of YOLOv5s and DLA-34 Backbone Models

by

Sang-Hyun Lee

and

Myeong-Hoon Oh

^*

Department of Computer Engineering, Honam University, Gwangsan-gu, Gwangju 62399, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(16), 6888; https://doi.org/10.3390/app14166888

Submission received: 10 July 2024 / Revised: 3 August 2024 / Accepted: 5 August 2024 / Published: 6 August 2024

(This article belongs to the Special Issue Integrating Artificial Intelligence in Renewable Energy Systems)

Download

Browse Figures

Versions Notes

Abstract

:

Modern aquaculture utilizes computer vision technology to analyze underwater images of fish, contributing to optimized water quality and improved production efficiency. The purpose of this study is to efficiently perform underwater fish detection and tracking using multi-object tracking (MOT) technology. To achieve this, the FairMOT model was employed to simultaneously implement pixel-level object detection and re-identification (Re-ID) functions, comparing two backbone models: FairMOT+YOLOv5s and FairMOT+DLA-34. The study constructed a dataset targeting the popular black porgy in Korean aquaculture, using underwater video data from five different environments collected from the internet. During the training process, the FairMOT+YOLOv5s model rapidly reduced train loss and demonstrated stable performance. The FairMOT+DLA-34 model showed better results in ID tracking performance, with an accuracy of 44.1%, an IDF1 of 11.0%, an MOTP of 0.393, and an IDSW of 1. In contrast, the FairMOT+YOLOv5s model recorded an accuracy of 43.8%, an IDF1 of 14.6%, an MOTP of 0.400, and an IDSW of 10. The results of this study indicate that the FairMOT+YOLOv5s model demonstrated higher IDF1 and MOTP scores compared to the FairMOT+DLA-34 model, while the FairMOT+DLA-34 model showed superior performance in ID tracking accuracy and had fewer ID switches.

Keywords:

multi-object tracking (MOT); computer vision in aquaculture; FairMOT model; YOLOv5s; DLA-34

1. Introduction

Fish are important components of marine ecosystems and human culture and industry. Worldwide, 3 billion people eat them in large quantities [1]. Efforts must be made to monitor and manage fish populations and species to maintain connected registers and fish stocks [2].

Modern aquaculture utilizes computer vision technology to analyze underwater images of fish, contributing to optimized water quality and improved production efficiency [3]. According to a 2014 report by the Food and Agriculture Organization (FAO), approximately 110 million tons of seafood are caught worldwide annually, providing about 20% of the animal protein intake for 3 billion people [4]. If overfishing due to global warming and ocean acidification continues, humans will no longer be able to consume seafood. In the long term, the best way to address seafood shortages and promote sustainable fishing is through aquaculture.

According to a report by Statistics Korea [5], the number of marine aquaculture workers decreased from 6236 in 2011 to 5524 in 2019, as shown in Table 1, and most of the aquaculture workers are over 60 years old, resulting in a severe labor shortage [5]. By introducing ICTs (Information and Communications Technologies) into the marine aquaculture industry, labor-intensive primary industries can be transformed into IT industries, enhancing competitiveness. Traditional aquaculture processes rely heavily on visual observations by managers, making them highly dependent on subjective experience and requiring significant time and labor to digitize the observed results.

Modern aquaculture mainly relies on sensors to acquire information about fish species, which is vast and complex, making it difficult to fully utilize. However, computer vision technology, based on image processing, pattern recognition, and artificial intelligence, is gradually advancing and can be widely applied in areas such as fisheries resource research, aquaculture, processing, and rare species protection through non-contact observation techniques [6]. Using computer vision technology to study underwater images of fish can enhance operational efficiency, refine data for aquaculture, optimize water quality, and achieve high production efficiency. Furthermore, this technology, combined with artificial intelligence, can enable intelligent task processing.

Previous research has explored various approaches to fish detection and tracking. For instance, Chuang et al. proposed a feature learning and object recognition framework for underwater fish images [7], Sadeghian et al. studied methods to track multiple cues with long-term dependencies [8], Wang et al. conducted research aimed at real-time multi-object tracking [9], and Yu et al. explored object detection through deep layer aggregation [10]. And Vishnu Kandimalla et al. conducted a study that integrated YOLOv3 and Mask-RCNN and YOLOv4 and Norfair tracking algorithms to track fish in video data. This integrated system continuously tracks the migration of the fish and is optimized to maintain high performance, especially in low-frame-rate video [11].

These studies have contributed to improving the efficiency of fish detection and tracking using computer vision and artificial intelligence technologies.

The purpose of this study is to efficiently detect and track underwater fish using Multiple Object Tracking (MOT) technology. To achieve this, we constructed a dataset focusing on rock bream, which is popular in Korean aquaculture, and utilized underwater video data collected from the internet in five different environments. The dataset was divided into Train and Test sets for training and evaluation. Each frame was labeled with the target fish object ID using the DarkLabel tool. Additionally, the FairMOT model was used to simultaneously implement pixel-level object detection and re-identification (Re-ID) functionality. We compared and analyzed two backbone models: FairMOT+YOLOv5s and FairMOT+DLA-34.

2. Material and Methods

In this study, we utilized the FairMOT model to efficiently detect and track underwater fish using Multiple Object Tracking (MOT) [9] technology. The research focused on three main components: Deep Layer Aggregation-34 (DLA-34), YOLOv5, and Fair Multi-Object Tracking (FairMOT). The roles and features of each component are detailed in the sections below.

2.1. Deep Layer Aggregation-34 (DLA-34)

Deep Layer Aggregation (DLA) is a fusion deep network used as a backbone, with the advantage of effectively aggregating information from various layers through deeper fusion. DLA-34 utilizes this structure to provide high accuracy and fewer parameters, and it was chosen as the backbone network to achieve multi-scale detection [11]. The difference between DLA-34 and the original DLA is the addition of more connections between the lower and upper layers, enabling more sophisticated feature extraction. DLA-34 is used as the backbone network for feature extraction, fusing features of different depths to generate high-resolution feature maps. These feature maps play a crucial role in accurately detecting the location and size of objects.

As shown in Figure 1, DLA-34 uses the DCNv2 (Deformable Convolution Network) for up-sampling during the feature extraction process. DCNv2 dynamically adjusts the receptive field according to the scale and pose of the target, helping to improve the anchor box alignment problem [11].

The DLA-34 backbone is designed to handle objects of various sizes and extract more sophisticated features through multi-layer aggregation.

2.2. YOLOv5s

YOLOv5 is an object detection model designed with both speed and accuracy in mind, consisting of four versions (YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x). As the model version increases, accuracy improves, and the detection time for a single image increases. The network structure of YOLOv5s is shown in Figure 2. The backbone consists of Focus and C3 structures, and the neck is composed of the Feature Pyramid Network (FPN) and Path Aggregation Network (PAN) [12,13,14,15].

The backbone of YOLOv5 transforms input images into feature maps, and the neck connects the backbone and head, refining and reconstructing the feature maps. The backbone of the YOLOv5 network plays a very important role in feature extraction, and the complexity of the backbone significantly affects the overall algorithm’s processing time. The Focus module in the backbone generates low-dimensional feature maps by slicing the image, and the Spatial Pyramid Pooling (SPP) layer allows for the network to input images of any size. After passing through the SPP layer, feature maps of arbitrary size are output as fixed-length vectors used for subsequent classification and detection tasks.

2.3. Fair Multi-Object Tracking (FairMOT)

FairMOT is a model that simultaneously implements pixel-level object detection and re-identification (Re-ID), effectively tracking multiple objects in each frame. The structure of FairMOT is shown in Figure 3.

FairMOT uses the DLA-34 backbone to extract features from each frame and predict the object’s location and ID. This model exhibits excellent tracking performance, especially in highly dynamic underwater environments. FairMOT’s structure consists of several key components. First, the Detection Head detects the location of the fish. Second, the Re-ID Embeddings assign a unique ID to each object, enabling continuous tracking. Third, the backbone network plays a crucial role in extracting feature maps throughout the system. Finally, the up-sampling layers increase the resolution to generate refined feature maps.

FairMOT is a multi-object tracking model [16,17,18] with two homogeneous branches that simultaneously predict pixel-level object detection and Re-ID functionality. This model starts by inputting the video sequence into the DLA-34 network. The DLA-34 network fuses features of different depths and transmits the extracted multi-layer information to the detection model and the Re-ID model.

The high-resolution feature maps extracted from the DLA-34 network are divided into several components. First, heatmaps represent the position information of the top-left (min x, max y) and bottom-right (max x, min y) corners and the confidence information at those positions. Second, the bbox size indicates the size of the object. Third, center offsets solve the accuracy loss problem that occurs when the heatmaps are down-sampled again after being initially up-sampled to 1/n. Finally, Re-ID Embeddings assign a unique ID to each object, enabling continuous tracking.

Heatmaps, center offsets, and bbox size are used to detect objects and transmit them to the detection model, while Re-ID Embedding transmits the target’s Re-ID information to the Re-ID model.

This study compared and analyzed the performance of underwater fish detection and tracking using two backbone models of the FairMOT model. This research aimed to explore methods to implement efficient and accurate underwater organism detection.

3. Design of FairMOT Mixed Model

The development process of this study can be broadly divided into two stages, each consisting of a series of procedures, aimed at developing a model that can effectively perform multi-object tracking in underwater environments. This development process includes extracting data from underwater fish videos to create a dataset and using the related data to train and evaluate the model.

The first stage is the data collection stage, where videos of fish filmed in underwater environments are collected and stored on a hard disk. The collected video data form the basis for all subsequent processing steps.

The second stage involves processing the data and consists of several detailed procedures as follows: First, the collected video data are preprocessed to convert it into a form suitable for analysis. Each frame of the preprocessed video is labeled with the position and ID of the fish. This labeling process involves identifying and tagging the fish objects in each frame of the video. The labeled data are then used to train the YOLOv5s backbone model. At this stage, the YOLOv5s model is trained to improve its object detection capabilities. Based on the trained YOLOv5s model, FairMOT is then trained. FairMOT is trained using two backbone models. The first is the FairMOT + YOLOv5s backbone model, and the second is the FairMOT + DLA-34 backbone model. In this process, the model learns the ability to detect the position of objects and assign IDs. Finally, the trained model is used to perform underwater fish tracking tests. In this stage, the model’s performance is evaluated by using test data to track the position and ID of the fish. The test results are analyzed, and the tracking performance of the model is evaluated to derive the final results.

In Figure 4, the DeepSORT algorithm process consists of the following five stages: First, the initial frame of the video sequence is taken. This initial frame serves as the starting point for the subsequent process. At this stage, the Kalman filter is used to predict and update the state of the target’s movement. This involves predicting and updating the state variable x_k and state covariance P_k.

In the prediction stage, the current state vector and covariance matrix are used to predict the next state vector and covariance matrix. The state transition matrix F transforms the previous state vector

{\hat{x}}_{k - 1}

into the current state vector

{\hat{x}}_{k | k - 1}

. The control input model B and control input

u_{k - 1}

reflect external inputs applied to the system. The covariance matrix

p_{k | k - 1}

represents the uncertainty of the state vector, which is updated by the state transition matrix F and process noise covariance Q.

In the update stage, the predicted state vector and covariance matrix are updated using the observation

z_{k}

. The observation model H represents the relationship between the actual observation and the predicted observation. The Kalman gain K_k corrects the difference between the predicted state and the actual observation. Finally, the updated state vector

{\hat{x}}_{k}

and covariance matrix P_k are calculated as shown in Equations (1) and (2).

Through this process, the DeepSORT algorithm can effectively track multiple objects in a video sequence.

The prediction stage is as follows:

{\hat{x}}_{k | k - 1} = F {\hat{x}}_{k - 1} + B u_{k - 1,} p_{k | k - 1} = F P_{k - 1} F^{T} + Q

(1)

The update stage is as follows:

K_{k} = P_{k | k - 1} H^{T} (H P_{k | k - 1} H^{T} + R)^{- 1}, {\hat{x}}_{k} = {\hat{x}}_{k | k - 1} + K_{k} (z_{k} - H {\hat{x}}_{k | k - 1})

(2)

P_{k} = (I - K_{k} H) P_{k | k - 1}

Here, F is the state transition matrix, B is the control input model, Q is the process noise covariance, H is the observation model, R is the observation noise covariance, and K_k is the Kalman gain [19]. The deep appearance descriptor and vector feature of the detected objects are extracted. The appearance features are represented by the vector f extracted through a CNN, and the motion features are represented by the state vector x obtained through the Kalman filter.

DeepSORT is an algorithm for multi-object tracking that utilizes both the appearance and position information of objects to track them. After extracting the appearance feature vector fi of each tracked object through a neural network, the minimum cosine similarity cos θ between the appearance feature vectors f_i of the current frame and f_j of the previous frame is calculated to determine the similarity as shown in Equation (3). The cosine similarity indicates the similarity between two vectors f_i and f_j based on the angle between them, with a value closer to 1 indicating higher similarity.

c o s θ = \frac{(f i \cdot f j)}{(| f i | | f j |)}

(3)

The appearance information through cosine distance is effectively used to retrieve the ID of objects that have been occluded for a long time [19]. DeepSORT calculates the similarity of motion information between the detection bounding box d and the predicted bounding box e using the Mahalanobis distance, as shown in Equation (4).

d_{M} (d, e) = {(d - e)}^{T} S^{- 1} (d - e)

(4)

The Mahalanobis distance represents the distance between two vectors d and e, considering the data distribution using the covariance matrix S. A smaller value indicates higher similarity between the two vectors [16]. The final score is a combination of the cosine similarity and the Mahalanobis distance, using weights α and β to comprehensively evaluate the similarity between objects, as shown in Equation (5).

F i n a l S c o r e - α \cdot \cos (θ) + β \cdot d_{M}

(5)

The final results are obtained by assigning IDs to each object through the associated cost matrix using the Hungarian algorithm. The Hungarian algorithm finds the optimal matching between objects and tracks based on the cost matrix, with a time complexity of O(n³). The steps of this algorithm are as follows:

Subtract the minimum value in each row to perform row minimization.
Subtract the minimum value in each column to perform column minimization.
Find the minimum number of lines that cover all zeros in the minimized matrix.
Find the smallest uncovered element and subtract this value from all uncovered elements, adding it to all covered elements.
Repeat this process to find the optimal matching.

The Hungarian algorithm plays an important role in multi-object tracking by finding the optimal association between objects and tracks. This helps maintain consistent IDs for objects and improves the accuracy of tracking through optimal matching [16]. Through this process, the DeepSORT algorithm can effectively track objects. The Mahalanobis distance and the Hungarian algorithm play critical roles in maximizing the efficiency and accuracy of underwater fish detection and tracking by considering data distribution and optimal matching.

4. Implementation

4.1. Development Environment

In this study, the software environment for the experiment was developed using Python 3.10 version, the artificial intelligence library used the PyTorch-based MMDetection API, and the hardware environment was Windows 11 for the OS, i9-9900k for the CPU, 128 GB 280 for the RAM, and 128 GB for the GPU. NVIDIA RTX 6000 (NVIDIA, Santa Clara, CA, USA) was used, and a detailed environment is indicated in Table 2.

4.2. Dataset

In this study, an integrated dataset for detecting and tracking rock bream (Acanthopagrus schlegelii) in underwater environments was constructed. The data were collected using underwater videos filmed in various environments from the internet, including five different environments as shown in Figure 5. The purpose of this dataset is to evaluate the performance of fish detection and tracking models under consistent and diverse conditions.

The dataset includes five different environments, each consisting of five videos. Each video is composed of a certain number of frames, ensuring various scenarios for evaluation [20]. The environments and related details are as follows: (a) clear water, natural habitat (5000 frames per video), (b) fish farms with artificial structures (4500 frames per video), (c) fish farm with lighting (5200 frames per video), (d) cloudy water, inside the fish farm (4800 frames per video), and (e) natural habitat rich in seaweed (5100 frames per video). Table 3 summarizes the dataset configuration for each environment.

To ensure a fair and consistent evaluation of the MOTFish and YOLOv5s models, the dataset includes integrated annotations suitable for both detection and tracking tasks. The detection annotations provide bounding boxes around each fish in each frame, and the tracking annotations include unique IDs to track each fish across multiple frames. This integrated annotation approach allows for both models to be trained and evaluated on the same dataset, ensuring consistency in evaluation criteria.

The dataset was preprocessed and labeled to support both detection and tracking tasks. It was then divided into training and testing sets for model training and performance evaluation. The training set consisted of a full 7081 frames, and the testing set consisted of a full 1300 frames. Table 4 summarizes the integrated dataset configuration and quantity for training and testing data.

Specific metrics are used to evaluate model performance. Metrics such as precision, recall, and mean average precision (mAP) are used to evaluate detection performance. Metrics such as Multiple Object Tracking Accuracy (MOTA) and the ID F1 score are used to evaluate tracking performance. Using the same dataset and evaluation framework ensures a fair comparison of the performance of the MOTFish and YOLOv5s models, allowing for a comprehensive evaluation of their ability to detect and track fish in underwater environments. This approach ensures that performance differences are due to the models themselves rather than variations in the dataset.

4.3. Evaluation Metrics

In this study, several metrics were used to evaluate the performance of the Multiple Object Tracking (MOT) model. The main evaluation metrics used are MOTA (Multiple Object Tracking Accuracy), MOTP (Multiple Object Tracking Precision), and IDF1 (Identification F1 score).

MOTA (Multiple Object Tracking Accuracy) is a metric that evaluates the accuracy of multi-object tracking by comprehensively considering the impacts of false negatives (FN), false positives (FP), and ID switches (IDSW). It measures the tracking accuracy of multiple targets in a single camera and is calculated using the following Formula (6):

M O T A = 1 - \frac{\sum_{t} ({F N}_{t} + {F P}_{t} + {I D S W}_{t})}{\sum_{t} {G T}_{t}}

(6)

Here, FN_t represents the false negatives (undetected objects) at time t, FP_t represents the false positives (incorrectly detected objects) at time t, IDSW_t represents the ID switches (incorrect ID matching) at time t, and GT_t represents the number of ground truth objects at time t.

MOTP (Multiple Object Tracking Precision) is a metric that evaluates the precision of multi-object tracking based on the distance error of each object and is calculated using the following Formula (7):

M O T P = \frac{\sum_{i} \sum_{t} d_{t}^{i}}{\sum_{t} c_{t}}

(7)

Here,

d_{t}^{i}

represents the distance error of object i at time t, and c_t represents the number of matched objects at time t. Additionally,

d_{t}^{i}

indicates the overlap area between the bounding box of target i and the matched bounding box at frame t. An overlap area within the 50–100% range is considered a correct match, and a higher MOTP value indicates better location accuracy of the tracked objects.

IDF1 (Identification F1 score) represents the F-value for identifying the target ID in each frame and is calculated using the following Formula (8):

I D F 1 = \frac{2 \cdot I D T P}{2 \cdot I D T P + I D F P + I D F N}

(8)

Here, IDTP (ID true positives) represents the number of correct object ID matches, IDFP (ID false positives) represents the number of incorrect object ID matches, and IDFN (ID false negatives) represents the number of missed object IDs. A higher IDF1 value indicates better consistency in maintaining object IDs, and this metric comprehensively evaluates the model’s performance in terms of detection accuracy and object identification consistency. Through these evaluation metrics, the performance of the proposed model in this study was comprehensively evaluated.

4.4. Comparison of Training Loss between FairMOT Models DLA-34 and YOLOv5s

Figure 6 compares the training results of the two backbones of the FairMOT model, FairMOT + DLA-34 and FairMOT + YOLOv5s. Both models were trained with an input image size of 128 × 64, a batch size of 8, and 60 epochs. This graph shows the change in train loss during the training process to identify the point where the training loss is minimized.

For the FairMOT-DLA-34 model, the initial training loss value starts at 1.955 and decreases sharply in the early stages of training, dropping rapidly until around 10 epochs. After that, it shows a gradual decrease, finally recording a loss value of 0.1067 at 60 epochs. The FairMOT model with the DLA-34 backbone shows a relatively low initial loss value and converges to a stable loss value quickly during the training process.

On the other hand, the FairMOT-YOLOv5s model starts with an initial training loss value of 2.348 and decreases sharply in the early stages of training, dropping rapidly until around 20 epochs. After that, it shows a gradual decrease, finally recording a loss value of 0.8541 at 60 epochs. The FairMOT model with the YOLOv5s backbone shows a higher initial loss value, a relatively slower training speed, and a higher final loss value.

The graph visually shows the difference in training speed and performance between the two models. The FairMOT-DLA-34 model shows faster training speed and lower final loss value compared to the FairMOT-YOLOv5s model, indicating superior performance. These results suggest that the DLA-34 backbone is more effective in feature extraction and has faster convergence during training compared to the YOLOv5s backbone. This graph plays an important role in determining the optimal number of epochs for model training and helps identify the optimal training point. This helps prevent overfitting and maximize the generalization performance of the model.

4.5. Results

The test results of the FairMOT-DLA-34 model proposed in this paper showed IDF1 11.0%, MOTA 44.1%, MOTP 0.393, and IDSW 1 as shown in Table 5. On the other hand, the test results of the FairMOT-YOLOv5s model showed IDF1 14.6%, MOTA 43.8%, MOTP 0.400, and IDSW 10. Comparing the test results of the two models, the FairMOT-DLA-34 model showed relatively good performance in IDF1, MOTA, and MOTP, but the FairMOT-YOLOv5s model showed more errors in ID switches (IDSW).

Figure 7 and Figure 8 show the test results using FairMOT-DLA-34 and FairMOT-YOLOv5s models. Figure 7 compares the fish detection results of the two models in the same frame. The FairMOT-YOLOv5s model accurately detected the fish in the 1290th frame, whereas the FairMOT-DLA-34 model failed to detect the fish in the same frame. This suggests that the model using the YOLOv5s backbone has higher detection accuracy.

Figure 8 compares the fish ID tracking results of the two models in the same frame [19,21]. The FairMOT-YOLOv5s model accurately tracked the fish ID until the 1137th frame, whereas the FairMOT-DLA-34 model failed in ID tracking in the same frame. This shows that the model using the YOLOv5s backbone has superior performance in ID tracking as well.

In conclusion, the FairMOT-DLA-34 model generally showed superior performance in detection and ID tracking but had relatively more errors in ID switches. In contrast, the FairMOT-YOLOv5s model showed relatively good performance in detection accuracy and ID tracking, and fewer errors in ID switches. Therefore, this paper suggests that the FairMOT-YOLOv5s model is more suitable when ID tracking performance is more important.

5. Conclusions

In this study, the FairMOT model was used to detect and track underwater fish using Multiple Object Tracking (MOT) technology. The performance of two backbone models, DLA-34 and YOLOv5s, was evaluated and compared.

The experimental results showed that the FairMOT-DLA-34 model performed well in IDF1, MOTA, and MOTP metrics, but had relatively high error rates in ID switches (IDSW). In contrast, the FairMOT-YOLOv5s model showed relatively better performance in IDF1 and IDSW and satisfactory results in MOTA and MOTP. Especially, the FairMOT-YOLOv5s model showed superior performance in detection accuracy and ID tracking.

Comparing the performance of the two models through video samples, the FairMOT-YOLOv5s model showed better performance in accurately detecting and tracking fish IDs in the same frame. In contrast, the FairMOT-DLA-34 model showed relatively lower performance in fish detection and ID tracking in specific frames.

Through this study, the strengths and weaknesses of the two backbone models of the FairMOT model for underwater fish detection and tracking were clearly identified. The FairMOT-YOLOv5s model is more suitable when ID tracking performance is important, while the FairMOT-DLA-34 model may be advantageous when considering detection accuracy and overall tracking performance.

Future research can contribute to the development of more efficient and accurate underwater fish detection and tracking systems by conducting additional experiments in various environments and comparing them with other backbone models.

Author Contributions

S.-H.L. and M.-H.O.: writing—original draft, data curation, software, and visualization. S.-H.L.: writing—review and editing. M.-H.O.: conceptualization, validation, writing—review and editing, and project administration. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2022-0-01000, Development of 5G Edge Computing SW for Flexible Healthcare to support Mobile Customized Medical Services).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data, models, and codes generated or used during the study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Vianna, G.M.; Zeller, D.; Pauly, D. Fisheries and Policy Implications for Human Nutrition. Curr. Environ. Health Rep. 2020, 7, 161–169. [Google Scholar] [CrossRef] [PubMed]
Hilborn, R.; Amoroso, R.O.; Anderson, C.M.; Baum, J.K.; Branch, T.A.; Costello, C.; de Moor, C.L.; Faraj, A.; Hively, D.; Jensen, O.P.; et al. Effective fisheries management instrumental in improving fish stock status. Proc. Natl. Acad. Sci. USA 2020, 117, 2218–2224. [Google Scholar] [CrossRef] [PubMed]
Food and Agriculture Organization of the United Nations. Available online: https://www.fao.org/fishery/en/countrysector/kr/en (accessed on 10 June 2023).
Johnson, J.; Bertram, I.; Moore, B.R.; Welch, D.J.; Williams, A.; Bell, J.; Govan, H. Effects of Climate Change on Fish and Shellfish Relevant to Pacific Islands, and the Coastal Fisheries they Support What is Already Happening? PACIFIC Mar. Clim. Chang. Rep. CARD Sci. Rev. 2018, iii, 74–98. Available online: https://www.ian.umces.edu/symbols/ (accessed on 15 June 2023).
Korean Statistical Information Service. Available online: http://kostat.go.kr/portal/korea/kor_nw/1/1/index.board?bmode=read&aSeq=381312 (accessed on 15 May 2023).
Cui, Z.; Wu, J.F.; Yu, H. A Review of the Application of Computer Vision Technology in Aquaculture. Mar. Sci. Bull. 2018, 20, 53–66. [Google Scholar]
Chuang, M.C.; Hwang, J.N.; Williams, K. A Feature Learning and Object Recognition Framework for Underwater Fish Images. IEEE Trans. Image Process. 2016, 25, 1862–1872. [Google Scholar] [CrossRef] [PubMed]
Sadeghian, A.; Alahi, A.; Savarese, S. Tracking the Untrackable: Learning to Track Multiple Cues with Long-Term Dependencies. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 300–311. [Google Scholar] [CrossRef]
Wang, Z.; Zheng, L.; Liu, Y.; Li, Y.; Wang, S. Towards Real-Time Multi-Object Tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 107–122. [Google Scholar]
Yu, F.; Wang, D.; Shelhamer, E.; Darrell, T. Deep Layer Aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 2403–2412. [Google Scholar] [CrossRef]
Kandimalla, V.; Richard, M.; Smith, F.; Quirion, J.; Torgo, L.; Whidden, C. Automated Detection, Classification and Counting of Fish in Fish Passages with Deep Learning. Front. Mar. Sci. 2022, 8, 823173. [Google Scholar] [CrossRef]
Jang, E.; Lee, S.J.; Jo, H. A New Multimodal Map Building Method Using Multiple Object Tracking and Gaussian Process Regression. Remote Sens. 2024, 16, 2622. [Google Scholar] [CrossRef]
Yu, P.; Yan, Y.; Tang, X.; Shang, Y.; Su, H. A Lightweight CER-YOLOv5s Algorithm for Detection of Construction Vehicles at Power Transmission Lines. Appl. Sci. 2024, 14, 6662. [Google Scholar] [CrossRef]
Li, X.; Lai, T.; Wang, S.; Chen, Q.; Yang, C.; Chen, R. Weighted Feature Pyramid Networks for Object Detection. In Proceedings of the 2019 IEEE International Conference on Parallel and Distributed Processing with Applications, Big Data and Cloud Computing, Sustainable Computing and Communications, Social Computing and Networking (ISPA/BDCloud/SustainCom/SocialCom), Xiamen, China, 16–18 December 2019; pp. 1500–1504. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking. Int. J. Comput. Vis. 2021, 129, 3069–3087. [Google Scholar] [CrossRef]
Milan, A.; Leal-Taixe, L.; Reid, I.; Roth, S.; Schindler, K. MOT16: A Benchmark for Multi-Object Tracking. arXiv 2016. Available online: http://arxiv.org/abs/1603.00831 (accessed on 1 August 2023).
Bernardin, K.; Stiefelhagen, R. Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics. Eurasip J. Image Video Process. 2008, 2008, 1–10. [Google Scholar] [CrossRef]
Wojke, N.; Bewley, A.; Paulus, D. Simple Online and Realtime Tracking with a Deep Association Metric. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar] [CrossRef]
Ristani, E.; Solera, F.; Zou, R.; Cucchiara, R.; Tomasi, C. Performance Measures and a Data Set for Multi-Target, Multi-Camera Tracking. In Lecture Notes in Computer Science; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; Volume 9914, pp. 17–35. [Google Scholar] [CrossRef]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple Online and Realtime Tracking. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar] [CrossRef]

Figure 1. Structure of DLA-34 backbone.

Figure 2. Structure of YOLOv5s backbone.

Figure 3. Structure of FairMOT network.

Figure 4. Process FairMOT mixed model.

Figure 5. Data extraction in five environments.

Figure 6. Comparison of training loss between FairMOT models DLA-34 and YOLOv5s.

Figure 7. Comparison of test results between FairMOT-DLA-34 and FairMOT-YOLOv5s.

Figure 8. ID detection comparison for FairMOT-DLA-34 and FairMOT-YOLOv5s.

Table 1. Status of aquaculture workers.

Year	2011	2012	2013	2014	2015	2016	2017	2018	2019
Number of Workers	6236	5816	5760	5637	5550	5438	5408	5635	5524

Table 2. Configuration of development environment.

Division	Specification
Operating system (OS)	Windows 11
Central processing unit (CPU)	Intel i9 9900K (Santa Clara, CA, USA)
GPU	NVIDIA QUADRO RTX6000
Memory	128 GB
Storage	Samsung M.2 1 TB (Suwon, Republic of Korea)

Table 3. Dataset configuration by environment.

Environment Description	Number of Videos	Number of Frames
(a) Clear water, natural habitat	5	5000
(b) Fish farms with artificial structures	5	4500
(c) Fish farm with lighting	5	5200
(d) Cloudy water, inside the fish farm	5	4800
(e) Natural habitat rich in seaweed	5	5100

Table 4. Dataset configuration and quantity.

Division	Train Data	Test Data
(a) Clear water, natural habitat	528	1300
(b) Fish farms with artificial structures	344
(c) Fish farm with lighting	181
(d) Cloudy water, inside the fish farm	4737
(e) Natural habitat rich in seaweed	1291
Total	7081	1300

Table 5. Test results of FairMOT.

Model	IDF1	MOTA	MOTP	IDSW
FairMOT-DLA-34	11.0%	44.1%	0.393	1
FairMOT-YOLOv5s	14.6%	43.8%	0.400	10

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, S.-H.; Oh, M.-H. Detection and Tracking of Underwater Fish Using the Fair Multi-Object Tracking Model: A Comparative Analysis of YOLOv5s and DLA-34 Backbone Models. Appl. Sci. 2024, 14, 6888. https://doi.org/10.3390/app14166888

AMA Style

Lee S-H, Oh M-H. Detection and Tracking of Underwater Fish Using the Fair Multi-Object Tracking Model: A Comparative Analysis of YOLOv5s and DLA-34 Backbone Models. Applied Sciences. 2024; 14(16):6888. https://doi.org/10.3390/app14166888

Chicago/Turabian Style

Lee, Sang-Hyun, and Myeong-Hoon Oh. 2024. "Detection and Tracking of Underwater Fish Using the Fair Multi-Object Tracking Model: A Comparative Analysis of YOLOv5s and DLA-34 Backbone Models" Applied Sciences 14, no. 16: 6888. https://doi.org/10.3390/app14166888

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Detection and Tracking of Underwater Fish Using the Fair Multi-Object Tracking Model: A Comparative Analysis of YOLOv5s and DLA-34 Backbone Models

Abstract

1. Introduction

2. Material and Methods

2.1. Deep Layer Aggregation-34 (DLA-34)

2.2. YOLOv5s

2.3. Fair Multi-Object Tracking (FairMOT)

3. Design of FairMOT Mixed Model

4. Implementation

4.1. Development Environment

4.2. Dataset

4.3. Evaluation Metrics

4.4. Comparison of Training Loss between FairMOT Models DLA-34 and YOLOv5s

4.5. Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI