1. Introduction
In recent years, there has been a surge in maritime accidents worldwide, resulting in significant human and economic losses. Since 2014, the number of maritime accidents has steadily increased, averaging 2647 casualties and incidents annually [
1]. Maritime search and rescue (SAR) operations play a vital role in national emergency response systems, with the primary challenge being swiftly and accurately locating and identifying objects at sea. The emergence of unmanned aerial vehicle (UAV) technology has revolutionized various fields, including robotics, security surveillance, intelligent transportation, wildlife conservation, and geospatial information [
2,
3,
4,
5,
6,
7]. Owing to their agility, portability, and aerial accessibility, UAVs offer rapid deployment, high data capacity, and outstanding spatial resolution, making them particularly effective for executing maritime SAR missions [
8].
However, identifying individuals in maritime distress within UAV imagery presents unique challenges due to their small scale. Operator fatigue and distractions can also lead to missed sightings of man-overboard situations in aerial images. The application of deep learning-based object detection algorithms has significantly improved this situation. UAVs equipped with such algorithms can rapidly cover extensive maritime areas while promptly identifying individuals in distress, ensuring that they can be rescued faster, thereby increasing the probability of survival. Nonetheless, UAV image object detectors must strike a balance between lightweight design and real-time processing due to hardware limitations and specific application scenarios. This study focuses on enhancing the efficiency of identifying man-overboard situations in UAV imagery by addressing these challenges.
Several scholars have demonstrated the exceptional performance of the You Only Look Once (YOLO) series in the field of object detection, particularly in UAV-based detection systems [
9,
10,
11]. Therefore, we have selected YOLOv7-tiny [
12] as the UAV visual framework for maritime search and rescue operations involving man-overboard situations.
Our primary contributions to this work are as follows:
The introduction of a lightweight object detector, “YOLOv7-FSB,” tailored to efficiently identify man-overboard situations for rescue missions.
To address the challenge of small objects in images, we propose a lightweight backbone network named “FSNet” to enhance the model’s global information perception and improve its robustness.
In response to the limited feature information of submerged individuals in aerial images, we introduce the module design called “SP-ELAN.” This module integrates channel reconstruction units, reducing feature redundancy and enhancing algorithm efficiency.
To enhance the utilization efficiency of feature information for submerged individuals in aerial images, we implement an improved Bidirectional Feature Pyramid Network. This network performs a bidirectional fusion of features related to submerged individuals in aerial images, integrating both local and global features.
The selection of ByteTrack as the tracking component, combined with the improved detection algorithm, demonstrates real-time detection and tracking capabilities through experimentation.
The remainder of this paper is organized as follows:
Section 2 reviews related research,
Section 3 introduces the algorithm for identifying maritime man-overboard situations based on aerial imagery, and detailed information about the experiments and analysis is provided in
Section 4. In
Section 5, conclusions are drawn, and future research directions are outlined.
3. Materials and Methods
YOLOv7-tiny, like other detectors in the YOLO series, features a structured architecture comprising a backbone, neck, and head. The structure of YOLOv7-tiny is illustrated in
Figure 1.
The backbone of YOLOv7-tiny includes multiple CBL modules, ELAN structures, and MP convolution layers. Each CBL module consists of convolution layers, batch normalization layers, and an activation function. YOLOv7-tiny employs LeakyReLU as its activation function, which is an evolution from the ReLU (Rectified Linear Unit) activation function. The ELAN structure enhances the network’s learning capacity by controlling the shortest and longest gradient paths, effectively extracting features. The downsampling structure is designed with MP convolution layers, incorporating both max-pooling and convolution layers to facilitate parallel feature extraction and compression.
The neck of YOLOv7-tiny adheres to the PANet structure from YOLOv5. It involves operations for upsampling and downsampling features of different scales obtained from the backbone. This enables feature fusion, enhancing the model’s ability to capture information across scales. The neck comprises modules such as SPPCSPC and ELAN-Y. The SPPCSP structure connects to the final layer of the backbone, introducing a substantial residual branch to optimize feature extraction, reduce computational load, and expand the receptive field.
The head module of YOLOv7-tiny encompasses detection heads operating at various scales, including large, medium, and small dimensions. These heads serve as the network’s classifier and regressor, managing classification and regression tasks for object detection. Ultimately, this module achieves the dual objectives of object classification and precise localization in the context of object detection.
YOLOv7 encompasses GIoU, DIoU, CIoU, and IoU. Notably, IoU is well suited for training sets with small samples, but given the abundant number of images in our dataset, IoU is not aligned with our circumstances. CIoU outperforms GIoU and DIoU by considering geometric parameters such as overlapping area, center-point distance, and aspect ratio. It effectively addresses issues related to inaccurate convergence and slow convergence speed. Hence, we employ CIoU as the model’s loss function to ensure accurate and rapid convergence.
Building upon this model, we introduce YOLOv7-FSB, tailored for real-time detection and tracking of individuals in maritime distress within aerial imagery captured by UAVs.
3.1. Improvement-Based FSNet
In aerial images captured by drones, the targets are often small, placing higher demands on the backbone network of a model. However, the onboard computational resources are limited, necessitating a lightweight yet powerful feature extraction backbone network. Therefore, we propose an innovative lightweight backbone network named FSNet, specifically designed for efficiently extracting features relevant to individuals in distress in aerial images captured by unmanned aerial vehicles.
FasterNet [
36], a crucial component of FSNet, comprises an inverted residual block that includes a partial convolution (PConv) layer, two 1 × 1 convolution layers, batch normalization, and ReLU activation. PConv has proven to be highly effective in extracting spatial features of individuals in maritime distress within aerial imagery. It not only reduces redundant computations but also minimizes memory access. Traditionally, lightweight model design focused on reducing floating-point operations (FLOPs). However, it was observed that simply reducing FLOPs did not significantly improve speed, as memory access often became the bottleneck. The introduction of PConv effectively addresses this issue.
The Simple Attention Module (SimAM) [
37], incorporated into FSNet, introduces three-dimensional attention weights that consider both spatial and channel dimensions. Unlike previous attention modules, SimAM excels in capturing both channel and spatial features, offering enhanced flexibility and modularity without the need for additional parameters in the base network.
As illustrated in
Figure 2, each FSNetBlock in FSNet cleverly integrates FasterNet with the SimAM attention module. Within an FSNetBlock, a combination of 3 × 3 PConv and 1 × 1 Conv is employed, where the use of 1 × 1 convolutional layers reduces the parameter count, accelerates the training speed, and enhances the model’s nonlinear fitting capability. However, these layers exhibit a limited receptive field, impeding the acquisition of global features. Leveraging the lightweight attention mechanism of the SimAM module addresses this issue, resulting in an improved receptive field for the model. The FSNet lightweight backbone network is ultimately composed of multiple FSNetBlocks. This design significantly enhances the extraction of relevant features for individuals in distress at sea, enabling FSNet to efficiently and rapidly extract these crucial features from aerial images.
FSNet offers several advantages, including reduced parameters, lower computational requirements, and superior feature extraction efficiency. These attributes are particularly relevant for real-time feature extraction in the context of maritime distress detection using UAV-captured aerial imagery.
3.2. Improvement-Based SP-ELAN
When detecting individuals in aerial images, the submerged person is often only partially visible above the water surface, limiting the available features in the image. This poses a greater demand for the feature extraction capabilities of the model. To enhance feature extraction in computer vision applications within the context of aerial image analysis, we drew inspiration from ScConv [
38] and PConv, leading to the development of the SP-ELAN module.
ScConv, following a split–transform–merge strategy, extracts features from multiple parallel branches with distinct roles and concatenates these outputs for the final result. It divides the input into two parts, each processed by branches dedicated to extracting diverse types of contextual information. One branch involves the adaptive calibration of input features through convolution filters, facilitating communication between the filters, while the other branch preserves the original spatial context. PConv, on the other hand, judiciously applies regular convolution to a subset of input channels, leaving the rest untouched. This approach optimizes computational resources and enhances the capability for extracting spatial features from aerial imagery.
As depicted in
Figure 3, the SP-ELAN module leverages the advantages of ScConv and PConv. Specifically, we strategically replace certain Convs in the original ELAN with PConv, and the fused results undergo adaptive calibration through ScConv to yield the final output. This approach effectively integrates self-calibration with original spatial context information, generating highly discriminative output content while simplifying parameters. Such integration significantly enhances the precision of feature extraction, positioning it as a valuable complement to computer vision applications.
3.3. Improvement-Based BiFPN-S
As a critical component bridging the gap between the backbone and the head, the neck plays a pivotal role in processing and amalgamating features extracted from the main trunk to better suit the requirements of object detection tasks. While the Feature Pyramid Network (FPN) has been a fundamental element in recognizing objects of varying sizes, its traditional top–down structure is limited by the unidirectional flow of information. To address this limitation, the Path Aggregation Network (PAN) introduced a bottom–up aggregation path, which improved accuracy but added complexity in terms of parameters and computational requirements.
In aerial images, the feature information of man-overboard situations holds particular significance. Traditional methods of feature fusion face challenges such as limitations in unidirectional information flow and high search costs. To establish a lightweight feature pyramid that strikes a balance between efficiency and accuracy, this paper introduces an enhanced version of the Bidirectional Feature Pyramid Network (BiFPN) [
28], referred to as BiFPN-S in this paper. The original concept of BiFPN aimed to enhance pathways through efficient bidirectional cross-scale connections and weighted feature fusion. However, higher-level feature layers offer limited assistance in detecting small targets in aerial images. To address this issue and achieve the goal of eliminating redundant feature extraction calculations without compromising the model’s ability to extract features related to man-overboard situations, we propose BiFPN-S.
In BiFPN-S, fusion is conducted as follows:
Here, a ReLU activation function is added after each set of weights,
to ensure
is introduced to avoid numerical instability. Taking the fifth layer as an example:
where
represents the intermediate feature of level 5 in the top–down path, and
is the output feature of level 5 in the bottom–up path. In this context,
denotes depthwise separable convolution, with batch normalization and activation functions added after each convolution. In our optimized version, we reduce the number of feature extraction layers, specifically eliminating the highest-level feature extraction layer, resulting in a lightweight model, as depicted in
Figure 4. BiFPN-S provides an effective solution to enhance feature extraction in drone imagery without introducing unnecessary complexity.
3.4. Tracking Model
Building upon our well-optimized object detection network, the next crucial step involves selecting an appropriate tracking method. In our evaluation, ByteTrack stands out as an exceptional choice, demonstrating outstanding performance and offering a streamlined solution for practical applications.
ByteTrack’s approach places a strong emphasis on low-scoring detection boxes. It efficiently reassigns these low-scoring boxes for matching with previously unassociated tracking trajectories once the high-scoring boxes have been matched. Furthermore, when ByteTrack encounters detection boxes with sufficiently high scores but cannot find a match, it initiates the creation of new tracking trajectories. During data matching, ByteTrack has a low dependency on ReID features for appearance similarity calculations. This is a crucial advantage, particularly in scenarios such as aerial images depicting individuals in water rescue situations, where the features available for identifying distressed individuals are notably limited.
Considering the specific challenges posed by water rescue scenarios in aerial images and the emphasis on efficient tracking, we have chosen ByteTrack as our tracking model. It aligns well with our objectives, and as a tracking-by-detection method, ByteTrack’s tracking effectiveness is highly contingent on the detector’s performance. When the detector performs well, it yields favorable tracking results.
In the future, we will refer to the initials FSNet, SP-ELAN, and BiFPN-S as our method, denoted as ‘YOLOv7-FSB.’ In summary, to achieve efficient and effective detection and tracking of individuals in water rescue scenarios, we leverage the optimized object detection algorithm presented in this paper. This involves the seamless integration of YOLOv7-FSB with ByteTrack to accomplish the tracking task. The collaboration between these two components results in a lightweight and efficient model, perfectly suited for the demanding requirements of water rescue scenarios.
4. Experiments
4.1. Dataset
To validate the enhanced performance of YOLOv7-FSB, we conducted experiments using a meticulously curated dataset that combines selected MOBDrone [
39] and SeeDronesSea [
40] datasets.
The MOBDrone dataset comprises 49 videos captured with a DJI FC2 camera mounted on a Phantom 6310 Pro V4 drone. These videos portray various scenarios simulating individuals falling into water, encompassing both conscious and unconscious individuals, as well as other objects. This dataset has been post-processed to a resolution of 1080p. Professional annotators manually labeled the bounding boxes for objects falling into five categories: person, boat, surfboard, wood, and lifebuoy. The SeaDronesSee dataset showcases a diverse range of situations, with altitudes spanning from 5 m to 260 m and varying viewing angles from 0° to 90°. Each frame is accompanied by the pertinent altitude, angle, and other metadata. This dataset is captured using multiple cameras, providing a wide range of scenarios. The dataset included annotations covering various categories, such as swimmers, boats, jet skis, life-saving equipment, and buoys.
MOBDrone focuses on individuals in maritime man-overboard situations who are not wearing life jackets, whereas SeaDronesSee encompasses a wide range of scenarios related to an entire rescue process. These datasets have been amalgamated and processed for joint utilization in validating the proposed methodology. By combining the scenarios from MOBDrone and the diverse conditions from SeaDronesSee, we achieve a more comprehensive assessment of the methodology’s performance in maritime search and rescue scenarios.
4.2. Experimental Setup
This research was conducted using the Linux operating system. The configuration included an Intel(R) Xeon(R) Gold 6338 CPU @ 2.00 GHz with a minimum clock frequency of 0.8 GHz, complemented by 512 GB of memory. We harnessed the power of the NVIDIA A100-PCIE-40GB graphics processing unit (GPU) with 40 GB of memory capacity. To leverage GPU acceleration, the system ran on CUDA 11.7, and we primarily employed PyTorch 2.0.1 as the deep learning framework.
The interplay between software and hardware is crucial for model performance, and equally significant in our experiments are the hyperparameter settings. Model performance and speed are significantly impacted by image size. The step size for each parameter update is determined by the learning rate, which should be carefully adjusted according to the specific problem and the model at hand. To facilitate improved model convergence, modifications to the learning rate decay frequency are necessary. The batch size should be adjusted to an optimal value to maximize memory utilization. Increasing the number of workers is required to expedite the data preparation process, but it should be carried out cautiously to prevent memory leaks. Finally, the overall duration of model training should be determined by selecting an appropriate number of epochs based on the task at hand and the complexity of the model. The experimental settings are summarized in
Table 1.
4.3. Evaluation Metrics
Confusion matrices are fundamental tools in the assessment of deep learning models, particularly in the context of computer vision. They offer a comprehensive means of evaluating a model’s performance by quantifying correct and incorrect predictions for each class.
Typically, a confusion matrix comprises four quadrants: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). True positives represent instances where the model accurately predicts positive cases. False positives refer to cases where the model incorrectly classifies negatives as positives. True negatives represent accurate predictions of negatives, while false negatives signify incorrect classifications of positives as negatives.
To provide a more comprehensive and intuitive evaluation of target detection and tracking algorithms, advanced evaluation metrics have been developed based on the foundation established by confusion matrices. These metrics significantly enhance our ability to understand and assess the performance of these algorithms.
4.3.1. Detection Evaluation Metrics
The evaluation of object detection algorithms hinges on two critical aspects: accurate object localization and correct object classification. In the context of deep learning for maritime applications involving target detection and tracking, evaluation metrics play a pivotal role in assessing the performance of detection models. IoU is a simple measure, but it focuses solely on overlap areas, disregarding considerations of object size and shape. While accuracy is easy to comprehend, it can be misleading in unbalanced datasets. Precision emphasizes correct positive predictions while ignoring false positives, and recall prioritizes correct positive detections while disregarding false negatives, making both less suitable for unbalanced data. AP calculates single-class accuracy, and mean average precision (mAP) aggregates multiple-class AP.
The primary evaluation metrics based on these criteria are presented in
Table 2. In the “Note” column of the table, “↑” indicates that a higher value corresponds to better model performance, while “↓” indicates that a lower value signifies better performance. “Perfect” denotes the theoretical value for optimal performance.
4.3.2. Tracking Evaluation Metrics
Evaluating the performance of object tracking algorithms requires consideration of two fundamental principles. Firstly, it involves a thorough examination of the algorithm’s ability to accurately locate target positions. Secondly, it evaluates the algorithm’s effectiveness in maintaining the individual identities of each target over time. Based on these principles, several advanced evaluation metrics have been developed to provide a comprehensive assessment.
Bernardin and Stiefelhagen introduced CLEAR MOT [
41], which serves as a metric for measuring the tracking model’s localization accuracy and association matching capabilities. Lisanti et al. proposed the ID Score [
42], focusing on the stability and durability of the tracker’s object tracking.
Table 3 summarizes these evaluation metrics.
4.4. Detection Results Discussion
4.4.1. Activation Function Comparison
We conducted a performance comparison of YOLOv7-tiny using two different activation functions: LeakyReLU and SiLU. SiLU is the default activation function used in YOLOv5 and YOLOv7, and it is recognized as an improved version of ReLU with a smoother behavior near zero. The experimental results, as shown in
Table 4, reveal that LeakyReLU outperforms SiLU as an activation function. Based on these results, we have selected LeakyReLU as the activation function for our model.
4.4.2. Comparison of FasterNet and FSNet Performance
To validate the effectiveness of FSNet, we conducted a performance comparison between FSNet and FasterNet as backbone network models. The experimental results, as presented in
Table 5, demonstrate that FSNet, being an enhanced version of FasterNet, excels in the detection of individuals in maritime distress in aerial imagery. This superiority is attributed to FSNet’s reduction in redundant computations and memory access while maintaining efficient computational processes and excellent feature extraction capabilities.
4.4.3. Ablation Test Comparison
To analyze the effectiveness of different methods, a series of ablation experiments were designed. Each set of experiments employed the same dataset, training parameters, and methods for training. YOLOv7 served as the baseline for the initial experiments. Subsequent experiments introduced the following enhancements: FSNet was integrated into YOLOv7-A, YOLOv7-B used SP-ELAN to improve accuracy, YOLOv7-C combined with BiFPN-S, YOLOv7-D integrated both FSNet and BiFPN-S, YOLOv7-E combined SP-ELAN and BiFPN-S, YOLOv7-F combined FSNet and SP-ELAN, and YOLOv7-FSB synergistically utilized all three methods.
As depicted in
Figure 5 and detailed in
Table 6, it is evident that, compared to the original YOLOv7-tiny model, the detection speed of each modular method has not been significantly affected, while detection accuracy has improved. This underscores the effectiveness of the lightweight techniques employed in this study for personnel overboard detection. Bold font indicates the best-performing results, and the same applies throughout.
The analysis of experiments 1–3 demonstrates that each method, when added, enhances both the model’s detection accuracy and speed compared to the baseline YOLOv7-tiny model. Among these methods, FSNet stands out with the highest speed of 98.1 frames per second and an mAP of 87.3%. The introduction of SP-ELAN effectively integrates self-calibration and original spatial context information, making more efficient use of the device’s computational capabilities, resulting in an mAP of 87.9% and a speed of 96.2 frames per second. The inclusion of BiFPN-S strengthens the model’s capability for extracting features of individuals in maritime distress and improves the inference speed, leading to an mAP of 87.5% and a speed of 92 frames per second.
Experiments 4–5 reveal that, in comparison to YOLOv7, the combination of FSNet with SP-ELAN and BiFPN-S individually boosts the mAP by 1.3%, 1.8%, and 1.5%, respectively. The integration of all three modules contributes to improved model performance while maintaining model detection speed, resulting in a more lightweight model without sacrificing accuracy.
Based on the summarized optimization methods, we developed a lightweight search and rescue algorithm tailored for personnel overboard scenarios. This algorithm combines FSNet as the backbone network, integrates SP-ELAN for model enhancement, and incorporates BiFPN-S for feature fusion. The proposed method maintains the same outstanding detection speed as the original YOLOv7-tiny model while enhancing the mAP by 2.3%.
4.4.4. Comparison of Different Object Detection Models
To further analyze the performance of the lightweight algorithm in detecting individuals in maritime distress in aerial imagery, a comparison was conducted on the test dataset between YOLOv7-FSB and other networks, including YOLOv8n, YOLOv7-tiny, YOLOv5s, RetinaNet, SSD, and EfficientDet. Given that two-stage detection models have longer inference times, which may not meet real-time requirements, we opted to compare them with faster single-stage detectors. The recognition results of each network model are presented in
Table 7.
When comparing the lightweight model YOLOv7-FSB to other models, the results indicate significant differences. The SSD model experiences a 47.6% decrease in speed, a parameter increase of 17.8 million, and a 22.5% drop in the mAP. The RetinaNet model shows a 28.3% reduction in speed, a parameter increase of 30.5 million, and a 21.1% decrease in the mAP. The EfficientDet model’s speed decreases by 51.1%, with a parameter increase of 14.7 million and a 36.1% drop in the mAP. The YOLOv5s model’s speed drops by 12.9%, with a parameter increase of 1.21 million and a 4.1% reduction in the mAP. The YOLOv7-tiny model experiences a negligible 0.01% reduction in speed, an increase of 0.2 million parameters, and a 2.3% decrease in the mAP.
By integrating FSNet, our model achieves heightened non-linear fitting capabilities, improved receptive fields, and an optimized parameter count and training speed. The incorporation of the SP-ELAN module adeptly allocates computational resources, seamlessly integrating self-calibration and spatial contextual information. Even with streamlined parameters, the model sustains a high degree of discriminative power in its outputs. With support from the BiFPN-S module, the model reinforces pathways through efficient bidirectional cross-scale connections and weighted feature fusion, ensuring a delicate balance between enhanced performance and swift inference speeds. In the context of personnel overboard detection and tracking, our proposed YOLOv7-FSB method surpasses all other algorithms in terms of detection accuracy and speed, making it a practical choice for real-world applications. These results underscore the effectiveness of the lightweight techniques employed in this study, enabling real-time detection tasks based on UAV imagery of individuals in maritime distress.
4.4.5. Results and Visualization
To enhance transparency and facilitate a more intuitive evaluation and comparison of the proposed small-target detection methods, we incorporated the Grad-CAM (Gradient-weighted Class Activation Mapping) technique [
43]. This approach visualizes the heat maps corresponding to detected objects, providing a visual representation of the network’s focus areas. Grad-CAM computes the gradients of the target class output based on the final convolutional layer’s feature maps. Subsequently, these gradients are leveraged to perform a weighted summation, resulting in activation maps that emphasize regions of interest. The visualization of these attention regions is crucial for understanding the network’s decision-making process, highlighting areas where the network is most confident about object detection or areas with high activation values.
Figure 6 illustrates Grad-CAM images for YOLOv7-tiny and YOLOv7-FSB across various scenarios, while
Figure 7 showcases corresponding Grad-CAM images for different enhancement methods in the same scene. In these images, brighter regions denote specific areas prioritized by the network. The enhanced models exhibit remarkable feature extraction capabilities, particularly in recognizing individuals in distress at sea and mitigating the impact of noise on the model.
As a result, the YOLOv7-FSB model demonstrates outstanding performance in the tasks of locating and rescuing individuals in maritime distress. We have employed this model as the detector in our tracking-by-detection visual framework.
4.5. Tracking Results Discussion
4.5.1. Comparison of Different Tracking Models
To validate the performance of the improved YOLOv7-FSB model when combined with ByteTrack and DeepSORT separately, we conducted experiments using three sets of image sequences. The results are presented in
Table 8.
In sequences 1 and 3, ByteTrack outperformed DeepSORT in terms of MOTA by 14.9% and 8.7%, respectively. Throughout the entire tracking process, the number of ID switches when using ByteTrack was significantly lower than when using DeepSORT. Trackers need to run in tandem with the detector, and the number of frames per second achieved using the ByteTrack algorithm was around 82. In sequence 2, ByteTrack’s MOTA was 5.3% higher than DeepSORT’s, with a similar number of ID switches. Typically, real-time monitoring requires processing at least 30 frames per second, ensuring the system can keep up with the flow of data. When combining the improved YOLOv7-FSC model with ByteTrack for validating video sequences, the average processing speed reached approximately 82.7 frames per second. Based on the data in the table, it is evident that ByteTrack, which relies on motion features alone, is more suitable for tracking maritime individuals in distress, offering significant advantages in terms of tracking efficiency and ID switching reduction.
4.5.2. Comparison of Different Detection Models
To assess the impact of the improved YOLOv7-FSB model on tracking results, we conducted experiments by combining YOLOv7-tiny and YOLOv7-FSB separately with ByteTrack. The results are displayed in
Table 9.
Across all the image sequences, when using YOLOv7-FSB as the detector for tracking, the number of ID switches was higher compared to when using YOLOv7-tiny as the detector. However, the MOTA value when combining YOLOv7-FSB with ByteTrack was higher than when using YOLOv7-tiny with ByteTrack. Typically, an increase in ID switches leads to a decrease in MOTA because it indicates a less stable tracking process. The situation where both the number of ID switches and the MOTA values increase is due to YOLOv7-FSB having fewer instances of missed detections, meaning it provides better object detection capabilities compared to YOLOv7-tiny. This further validates the effectiveness of the YOLOv7-FSB model proposed in this paper.
In conclusion, the lightweight solution presented in this paper, YOLOv7-FSC, reduces the number of parameters and computations, allowing it to overcome the constraints of UAV computing resources. When detecting individuals in maritime distress in aerial images, this algorithm achieves detection speeds comparable to YOLOv7-tiny while significantly improving detection accuracy. Combining YOLOv7-FSC with ByteTrack results in excellent tracking performance, meeting the practical engineering requirements for UAV-based search and rescue operations. The model proposed in this article can find people who have fallen into water faster and improve the possibility of their rescue, which is of great significance.
4.6. Future Research Directions
In the future, our research will continue to focus on improving detector accuracy by exploring multi-sensor fusion techniques. We aim to integrate data from multiple sources, such as visible light, thermal imaging, near-infrared, etc., to enhance the system’s detection robustness. Additionally, we plan to refine the architecture and expand functionalities to create a more powerful and comprehensive solution.
5. Conclusions
This study presents YOLOv7-FSC, a novel algorithm designed for real-time detection and tracking of individuals in maritime distress at sea when they fall overboard. The algorithm is built upon the YOLOv7-tiny framework with a focus on lightweight design to reduce detector size, maintain recognition speed, and enhance detection accuracy, thereby ensuring real-time recognition of individuals in maritime distress in aerial images. YOLOv7-FSB, the proposed model, employs FSNet as the backbone network, reducing redundant computations and memory access. The SP-ELAN module optimizes device computational capabilities. Additionally, the enhanced feature pyramid structure, BiFPN-S, bolsters its feature extraction capability and inference speed. To validate the effectiveness of YOLOv7-FSC, rigorous testing was conducted using datasets selected from MOBDrone and SeaDronesSee as benchmarks. The testing included ablation experiments and comparative trials. Subsequently, we combined the lightweight YOLOv7-FSC model with ByteTrack as a detection-based tracker, ensuring that the tracking performance meets the real-time and accuracy requirements for detecting and tracking individuals in maritime distress in aerial images. The visual model proposed in this paper can accurately perform real-time detection and tracking tasks, offering a suitable technological solution for large-scale and rapid search and rescue operations for individuals in maritime distress.