Drone-Based Detection and Classification of Greater Caribbean Manatees in the Panama Canal Basin

Sanchez-Galan, Javier E.; Contreras, Kenji; Denoce, Allan; Poveda, Héctor; Merchan, Fernando; Guzmán, Hector M.

doi:10.3390/drones9040230

Open AccessArticle

Drone-Based Detection and Classification of Greater Caribbean Manatees in the Panama Canal Basin

by

Javier E. Sanchez-Galan

^1,2,*

,

Kenji Contreras

³,

Allan Denoce

⁴,

Héctor Poveda

^2,3

,

Fernando Merchan

^2,3

and

Hector M. Guzmán

^5,*

¹

Facultad de Ingeniería de Sistemas Computacionales, Universidad Tecnológica de Panamá (UTP), El Dorado P.O. Box 0819-07289, Panama

²

Centro de Estudios Multidisciplinarios en Ciencias, Ingeniería y Tecnología (CEMCIT AIP), Universidad Tecnológica de Panamá (UTP), El Dorado P.O. Box 0819-07289, Panama

³

Grupo de Investigación en Sistemas de Comunicaciones Digitales Avanzados (GISCDA), Facultad de Ingeniería de Electrica, Universidad Tecnológica de Panamá, El Dorado P.O. Box 0819-07289, Panama

⁴

École Nationale Supérieure D’électronique, Informatique, Télécommunications, Mathématique et Mécanique de Bordeaux (ENSEIRB-MATEMECA), 33402 Talence Cedex, France

⁵

Smithsonian Tropical Research Institute, Panama P.O. Box 0843-03092, Panama

^*

Authors to whom correspondence should be addressed.

Drones 2025, 9(4), 230; https://doi.org/10.3390/drones9040230

Submission received: 4 December 2024 / Revised: 6 March 2025 / Accepted: 17 March 2025 / Published: 21 March 2025

(This article belongs to the Special Issue Drone-Based Wildlife Protection, Monitoring, and Conservation Management：2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

This study introduces a novel, drone-based approach for the detection and classification of Greater Caribbean Manatees (Trichechus manatus manatus) in the Panama Canal Basin by integrating advanced deep learning techniques. Leveraging the high-performance YOLOv8 model augmented with Sliced Aided Hyper Inferencing (SAHI) for improved small-object detection, our system accurately identifies individual manatees, mother–calf pairs, and group formations across a challenging aquatic environment. Additionally, the use of AltCLIP for zero-shot classification enables robust demographic analysis without extensive labeled data, enhancing model adaptability in data-scarce scenarios. For this study, more than 57,000 UAV images were acquired from multiple drone flights covering diverse regions of Gatun Lake and its surroundings. In cross-validation experiments, the detection model achieved precision levels as high as 93% and mean average precision (mAP) values exceeding 90% under ideal conditions. However, testing on unseen data revealed a lower recall, highlighting challenges in detecting manatees under variable altitudes and adverse lighting conditions. Furthermore, the integrated zero-shot classification approach demonstrated a robust top-2 accuracy close to 90%, effectively categorizing manatee demographic groupings despite overlapping visual features. This work presents a deep learning framework integrated with UAV technology, offering a scalable, non-invasive solution for real-time wildlife monitoring. By enabling precise detection and classification, it lays the foundation for enhanced habitat assessments and more effective conservation planning in similar tropical wetland ecosystems.

Keywords:

Greater Caribbean Manatee; population assessment; habitat use; object detection; zero-shot learning; UAV footage; Gatun Lake, Panama

1. Introduction

The Great Caribbean manatee, Trichechus manatus manatus, known as the Antillean Manatee [1], is an aquatic herbivorous mammal that inhabits tropical and subtropical wetlands and rivers, ranging from Mexico to Brazil and the Western Caribbean [2]. It was considered an endangered species by the International Union for the Conservation of Nature (IUCN), due to a decrease in the regional population more than a decade ago [3]. This decrease has been attributed to various threats, including habitat degradation, illegal hunting, boat collisions, and low genetic variability [4,5,6]. Understanding habitat use and accurately estimating population size are fundamental to developing effective protection and conservation policies.

In Panama, manatees have been under legal protection since 1967 [7]. It is a species traditionally linked to the Caribbean costs of Panama, specifically in the turbid brackish waters covered by thick aquatic vegetation in the wetlands of the Changuinola and San San Pond Sak rivers, in the province of Bocas del toro (Western Panama).

Side-scan sonar has been shown to be effective in the counting and mapping of manatees, revealing seasonal variations in abundance [5,8]. More importantly, bioacoustic methods have determined that between 18 and 33 individuals are found in [9,10].

However, the discovery of a significant population in the Panama Canal Basin is groundbreaking [11]. Recent studies now reveal that the Canal Basin harbors approximately 20 to 25 individuals, a number that is comparable to those that inhabit the rivers in Bocas del Toro. This novel finding opens a new chapter in understanding the distribution of manatees and underscores the importance of innovative survey techniques to estimate changes in their population number, as well as to understand how local manatees use their habitat and where they can be observed seasonally and annually around the Basin.

As stated above, research efforts have been proposed to improve estimates and conserve manatees using non-invasive methods regionally and locally, such as sonar [5], aerial surveys using high-wing airplanes [12,13], and extensive acoustical monitoring [9]. Consequently, we have devised digital signal processing and machine learning approaches for passive monitoring using bioacoustical recordings [14]. Specifically, we have focused on understanding the acoustic properties of manatee vocalizations [15]. This understanding has been translated into methods to facilitate the detection, identification, and counting of unique individuals using various machine learning techniques [9,10,16,17]. These methods have been used to implement an edge computing device for real-time acoustic detection [18] and a data analysis platform [19] to estimate the demographics of manatees in Panama. However, crewed aircraft surveys face significant logistic and cost-efficiency challenges due to the nature of tropical wetlands, particularly when taking aerial images of turbid brackish waters covered in dense aquatic vegetation [5,9].

Using UAVs for marine wildlife research have gained popularity in recent years. It encompasses the use of both RGB and satellite spectral images [20,21]. As technological developments rapidly advance the versatility and functionality of affordable devices, their potential as a marine aerial survey tool has garnered attention for monitoring aquatic creatures such as humpback whales [22], river dolphins [23], and various manatee species [24,25], among others [26]. For this purpose, machine learning models based on deep convolutional neural networks (DCNNs) have enabled the automatic detection of UAV imagery, mostly only using RGB channels with deep convolutional neural networks (DCNNs).

In this regard, the work of Dujon et al. [27], which focused on understanding the effects of morphology, spacing, behavior and habitat of Australian seals, loggerhead sea turtles, and gannets on CNN-based object detection tasks, can be found to serve as a great example. Similarly, the performance of convolutional object detection architectures has been tested for the aerial survey of Chinese whales [28] and sharks [29].

Recent studies have used YOLOv8 for automatic object detection and segmentation, focusing on the preservation of wildlife on various scales, such as monitoring multiple endangered species [30] or estimating the population of specific subspecies [31]. However, for aerial images taken by drones, the detected objects are mostly small targets; therefore, the target scale changes considerably due to the influence of the aerial perspective [32]. In this context, some authors have attempted to solve the detection scale problem by modifying the YOLOv8 architecture. Liu et al. [33] replaced several components of the original architecture to improve the performance of the model and generalizability. Similarly, Wang et al. [32] added an attention mechanism to help the model focus on important information. Their multiscale feature fusion network significantly improves detection performance, especially for small objects as demonstrated by their experiments on the VisDrone2019 dataset. Another approach [34] focused on improving the model performance in varying lighting conditions [35].

Moreover, Sliced Aided Hyper Inferencing (SAHI) is an emerging technique that enhances small-object detection by automating tile overlap during model inference. Tile overlap enables the detection of fringe image objects that in adjacent tiles might otherwise split. A recent study demonstrated that SAHI increased average precision by approximately 5.7% for three different predictors after model tuning [36]. This tool improved detection in aerial UAV image analysis [37,38] and is recognized as one of the core solutions of this study, in addition to fine-tuning the augmentation capabilities of YOLOv8.

In recent years, significant progress has occurred in the field of zero-shot learning (ZSL). This machine learning paradigm enables models to make inferences without a training phase. Unlike traditional methods, ZSL leverages shared knowledge or attributes between categories instead of relying on direct experience with each specific class [39]. This began with the development of pre-trained large language models such as GPT-3 [40], and later continued with the rise of models capable of combining visual and textual features such as CLIP (Contrastive Language–Image Pre-Training) [41]. In this work, we employ the zero-shot capabilities of AltCLIP [42], a vision and language model based on CLIP, which was trained on millions of images and a text corpus from crawling the internet. This is to have similarity in the classification using custom text prompts to define five image classes in our study.

Although drones, or Unmanned Aerial Vehicles (UAVs), have not been extensively explored for manatee monitoring in Panama, our research addresses a single question: if employing UAV images, training YoloV8 detection and SZL classification models can be used as a viable and effective method for detecting manatees in the Panama Canal Basin.

The main contribution of this study is the integration of advanced machine learning techniques, such as deep convolutional neural networks (YOLOv8), Sliced Aided Hyper Inferencing (SAHI), Zero-Shot Learning (ZSL), and Contrastive Language–Image Pre-Training (CLIP), to tackle the complex challenge of detecting and counting manatees in drone footage. This problem is particularly non-trivial due to the small size of the targets, their partial submersion, and the turbid waters of their habitat, such as those found in Panama’s wetlands. By combining these methods, our approach improves the detection accuracy and robustness, addressing key limitations in traditional aerial survey techniques.

2. Materials and Methods

2.1. Study Location

Gatun Lake, which covers approximately 436 km ²(Figure 1), was created in 1913 as a water reservoir for the Panama Canal, which spans 80 km from the Atlantic to the Pacific [43]. In 1913, an invasion of aquatic plants with hyacinth was reported in the lake, and efforts were made to find natural ways to control its spread in the lake [44]. In June 1964, nine Great Caribbean manatees (Trichechus manatus) and one Amazonian manatee (T. inunguis) were introduced (as they are not native) into Gatun Lake to control aquatic plants and help address the proliferation of malaria-transmitting mosquitoes for public health purposes. However, the project did not achieve its initial objectives [11,45].

According to the Public Health Chapter of the 1966 Panama Canal Company and Canal Zone Government Reports, “...aquatic vegetation in the Chagres River and certain areas of Gatun Lake will require more than a large herd of manatees for effective control”. It was concluded that the size and scope of the Canal Zone waters “...will not be effectively controlled with the use of manatees”. As a result, the animals were abandoned after escaping from the enclosure, and the authorities terminated the experiment.

They are now found in the waters of the Panama Canal [11,46,47]. Reports also suggest that the manatee population has grown extensively [48]. In 2020, a manatee sighting was recorded in the Panamanian Pacific [11], confirming reports from several decades ago [47].

This can harm this species and simultaneously pose a hazard to the normal functioning of the Panama Canal for several reasons. Moreover, it can harm the natural balance and threaten the native local fauna. Thus, it is of the utmost importance that non-invasive methods are used for the assessment of this population that inhabits the Panama Canal Basin.

2.2. Data Collection

Gatun Lake, approximately 436 km ², was divided into 44 polygons of varying size (ranging from 48 to 856 ha) to cover the entire lake. This division accounted for overflight restrictions imposed by the Panama Canal Authority on the main shipping channel, and also takes into account the morphology that exists where the borders of the lake and the existing forests vary in size. Finally, they were designed to reduce flight time. The locations of these 44 polygons are shown in Figure 1.

Each polygon was covered by preprogrammed flight lines, usually with a 25% forward and 40% lateral image overlap and speeds between 5 and 12 m/s. The angle of the gimbal was fixed all the time. The flight altitude was maintained between 60 and 100 m, according to the specific flight area. All flights were conducted every 15 days to cover the entire area, depending on weather conditions (wind, rain), between 8:00 and 11:00 h, from February 2023 to April 2024. Two drones were used: a DJI Mavic 3 Enterprise model (DJI, Shenzhen, China) equipped with a 20 MP camera and a DJI Matrice 200 model (DJI, Shenzhen, China) with a 24 MP camera. Each one was flown by their own licensed pilot. Photos were taken every 3–5 s with a consecutive image percent overlap of 20 and 80 for front and side, respectively, taken along the flight path.

2.3. Data Organization

Two separate pilots obtained three image collections following a previously defined data collection protocol. The first two collections comprised 64,240 and 15,458 images for Pilots 1 and 2, respectively. Subsequently, these datasets were used for training and testing purposes (chosen randomly in each set). A third collection of 22,336 images from the latest pilot 1 flights was separated as unseen data to test the validity of the model. Original image resolutions ranged from 3840 × 2160, 4000 × 3000, and 5280 × 3956 pixels for Pilot 1 and 9504 × 6336, 6016 × 4008 and 6016 × 3376 pixels for Pilot 2. After manually inspecting both image collections, we observed differences in the quantity and quality displayed in diverse conditions such as altitude, landscapes, and visibility of the target animals. Taking these factors into consideration, the Pilot 1 collection was selected as the training dataset due to a higher quantity of images and less complex environments with more visible animals. On the other hand, Pilot 2 was selected for testing, as it contained images with higher complexity (less visible animals in worse lighting conditions); this is ideal to test the model´s performance and detect possible signs of overfitting.

A total of 65 flights were conducted, covering an approximate area of 26,678 hectares in 220.9 flight hours. Initially, the polygons were sampled biweekly until we found the areas most frequented by manatees, where we focused our subsequent periodic efforts.

From these efforts, 76 manatees were observed in 57,332 images acquired and initially analyzed manually. Sixteen mother–calf pairs (ranges 1–3 pairs) were observed on nine sampling days, yielding 179 images for both pilots.

For object detection tasks, images of the same animals were arranged as a series of sequences (see Figure 2) to avoid mixing similar images in training, validation, and testing datasets. Pilot 1 images, consisting of 30 sequences, were used for training, while Pilot 2 images, which had the fewest sequences, were used for testing.

On the other hand, for classification tasks that did not require a training phase, a subset of 316 image crops was used to evaluate the performance of ZSL. This dataset comprised images of manatees from all sequences manually selected to avoid similar instances of the same animals. It also includes photos of logs and background images (see Figure 2).

2.4. Detection Model

The YOLOv8 architecture comprises three primary components: the backbone, neck, and head. The backbone typically employs a deep convolutional neural network (DCNN) to extract feature maps from the input images. These feature maps are then passed to the neck, which utilizes pyramidal methods to enhance and refine features, capturing multiscale information crucial for detecting objects of various sizes. Finally, the processed features are fed into the head module, which predicts the bounding box coordinates of each detected object and the class probabilities.

YOLOv8 was chosen for its balance between performance, efficiency, and real-world applicability [49]. More importantly, YOLOv8 has been shown to work well in determining small targets in vast backgrounds. This is especially true when coupled with “Sliced Aided Hyper Inferencing (SAHI)” [50,51,52]. Thus, it is a perfect fit for the detection and classification of small, partially submerged manatees in complex aquatic environments such as the Panama Canal basin.

Although newer YOLO versions may offer improvements [53,54], one of the future objectives of this study is to potentially implement a customized embedded system, and for this reason, YOLOv8 remains a stable, well-supported, and computationally feasible choice.

This architecture allows YOLO models to achieve a balance between accuracy and speed, making them highly effective for real-time object detection tasks built from the following essential blocks. The building blocks of this network are the following:

CBS layers: They are compound layers of convolutional filters, batch normalization, and the SiLu loss function. It performs a convolution operation that extracts images features by applying filters to the input data.
C2f layers: They are designed to enhance feature extraction and fusion. They do this by splitting the input features into two parts: one is processed through several convolutional layers, while the other remains unchanged. Then, these parts are merged to allow the network to retain original information while adding refined features.
SPPF layer: This is a fast variant of the Spatial Pyramid Pooling (SPP) layer. It pools features using different sizes to capture details from various levels and then combines these pooled features into a comprehensive representation. This enhances the model’s ability to recognize objects of different sizes while keeping the computational efficiency.
Concatenation layer: This is used to combine several feature maps along a specified dimension (usually the channel dimension).
Upsampling layer: This is used to increase the spatial resolution of feature maps.

A visual description of the network is provided in Figure 3.

2.5. Training and Validation Parameters

For all experiments, fine-tuning was performed on YOLOv8s (small) parameters, a smaller model version with default parameters for the learning rate and optimizer. Training and validation samples were generated using 640 × 640 crops (with 20% horizontal and vertical overlap) using SAHI. All samples were labeled using the YOLO bounding box format with two target classes: class “manatee” and class “logs”. Figure 4 provides a clear example of the two classes co-existing in a single image.

Figure 3. Simplified graphical representation of the YOLOv8 architecture.

Figure 5 provides examples of the “logs” class, which also contains the undesired background objects and landscapes that can be found in the dataset.

Model parameters include online data augmentation. These techniques, native to YOLOv8, apply image and color space transformations to random images spliced into a composite image and fed directly into the neural network. This technique is inherited from the previous model version, YOLOv4 [55], allowing the identification of objects at a smaller scale than usual while significantly reducing the need for a large mini-batch size.

Figure 6 provides examples of the data augmentation transformations applied during training.

2.6. Zero-Shot Classification Model

Once manatees were detected by YOLOv8, a method was designed to count individuals and post-classify images using the zero shot learning (ZSL) approach. This method addresses a significantly important issue that arises when counting individuals, adult–calf pairs, or large groups that might appear in the current dataset. The model often can make mistakes, as it is trained on limited samples and highly imbalanced classes.

ZSL is implemented as a similarity learning algorithm. That is, a large pre-trained model (such as AltCLIP) enables image and text features to coexist in the same dimensional space (i.e., embeddings) and then calculates similarity using the cosine distance [42].

AltCLIP represents a novel integration in this context, as it addresses the challenges of detecting small, partially submerged manatee groups in such complex environments such as the Panama Canal Basin. Figure 7 provides a visual description of the AltCLIP model architecture used for image processing.

The model creates an image embedding using the CLIP encoder. A similar process is performed with a multi-language encoder to produce another embedding, which is then projected to co-exist in the same embedding space. Contrastive learning is the core of making text and image embeddings coexist, as the model was previously trained using pairs of matching (positive) and non-matching (negative) image–text examples. Therefore, it uses the acquired knowledge from general classes to adapt to the downstream task (classification of different manatee classes) [42]. Finally, after calculating cosine similarity, the resulting arbitrary values are converted into a probability distribution to find the most probable text class assigned to the input images.

In this way, the model can classify without needing labeled training data. The model requires the following components to perform the classification process:

Image pre-processing: To normalize the appearance of manatees in diverse environments and lighting conditions, we employed YCrBr decomposition. This techniques separates the luminance (Y) component of an image from chrominance (Cb and Cr) (see Figure 8). It was demonstrated that subtracting the Y channel is crucial for separating luminance and reflectance, which helps to reduce distortions and normalize the effects of water [56]. We employed the same technique using the Y channel as the primary input for the model, preceded by normalization and resizing to 224 × 224 pixels (the input size for AltCLIP).
Custom class prompts: Table 1 shows the descriptive text used to define the classes. Since large models are trained with thousands of general classes, we relied on the literal word, for instance, “manatee” and a focused on descriptive text based on morphological features such as the distinctive “fish” shape of manatees (this was when observed from a high perspective). These methods proved to be the most effective way to guide the model to understand the features of the classes. In contrast, more literal words were used to describe the “background” and “logs” classes.

2.7. Model Evaluation Metrics

As with any classification task, the purpose is to build a model that learns the features that define each class. In our case, the images were labeled as belonging to two classes, manatee and logs, and the rest were taken as the background. Logs were considered an additional class since they are widely present in many locations of the Panama Canal due to drought periods, making them visible due to lower water levels in the landscape [57,58]. More importantly for the model, in some cases, the silhouettes of the logs were like manatees and a source of classification errors during the first training runs.

Classic accuracy metrics (precision and recall) were used to evaluate model performance in these two classes. These metrics include mean average precision (mAP) and intersection over the union, which are briefly explained here as follows:

Intersection over Union (IoU): It is a number that indicates the overlap between the predicted bounding box coordinates of those of the ground truth (where the class is in the image). Higher IoU indicates that the coordinates align more, which is a better prediction. This is important, as the IoU threshold value helps us decide whether a prediction is True Positive, False Positive, or False Negative. It is calculated with Equation (1):

$IoU = \frac{A r e a_{I n t e r s e c t}}{A r e a_{U n i o n}}$

(1)

In this equation, the area of intersection is the region where the predicted and ground truth boxes overlap, while the area of union is the total area covered by both boxes combined, accounting for the overlapping region only once.
Mean average precision (mAP): This compares the predicted class associated with the region of interest and the ground truth found in the original label of the images and returns a score. The higher the score, the more accurately the model detects the object in question. It is calculated with Equation (2):

$mAP = \frac{1}{N} \sum_{n} (R_{n} - R_{n - 1}) \times P_{n}$

(2)

where $R_{n}$ and $P_{n}$ are the precision and recall at the $n^{t h}$ threshold. mAP is evaluated using a confidence threshold, which is the minimum probability of an output detection being considered correct or incorrect. Detection models are usually evaluated at 50% confidence (mAP50) and averaged between 50% confidence (mAP50), and averaged between 50% and 95% (mAP50:95).
Fitness: This metric is used in YOLO models to compare the performance of different models as a weighted combination of mean average precision. It assigns a 10% weight to mAP50 and 90% to mAP50:95 [59]. It is calculated using Equation (3)

$Fitness = 0.1 \times mAP 50 + 0.9 \times mAP 50 : 95$

(3)

This Equation is used to evaluate the training of all YOLOv8 models by implementing a cross-validation stategy.
Top-k Accuracy: This is employed to evaluate classification models by measuring the proportion of instances where the true label is among the top k predicted labels ranked by the model’s confidence. It generalizes traditional accuracy where only the top prediction counts and by considering multiple plausible predictions for each image. This image helps to understand the possible overlaps between classes in zero-shot classification.

2.8. Experimental Setup

The following experiments were performed in the following setup:

Detection using YOLOV8 (Experiment 1): Fine-tuning was performed using the training subset described in Section 2.3; a 10-fold random cross-validation experiment was conducted. Afterwards, using the model with the highest fitness value (Section 2.7), the performance on the test subset was evaluated to determine the degree of learning of the model and its effectiveness on unseen data.
Detection of small objects in full-resolution images using SAHI (Experiment 2): Using the top-performing model from Experiment 1, we conducted inference on full-resolution sequences to evaluate the model’s capability in a real-world scenario. The goal was to determine if manatees, as small objects in large-scale imagery, could be reliably detected when dealing with the complexities of high-altitude aerial frames.
Zero-shot classification of manatees (Experiment 3): Following the detection performance evaluation, we tested the ability of AltCLIP to identify individual manatees without training data. This was performed using the previously used training and testing subsets from Experiment 1, using the image processing techniques and text prompts outlined in Section 2.6. The goal was to assess the effectiveness of zero-shot learning on counting individual animals and identifying key demographic groupings such as calf–mother pairs.

2.9. Computational Hardware and Software

All experiments were performed using a custom PC with AMD Ryzen 9 5950X CPU processor, EVGA RTX 3080 with 10 VRAM GPU, and 128 GB of RAM running on Ubuntu 22.04 kernel. Python (3.9.17) and libraries for image and data pre-processing were used, among them: Numpy (1.26.4), Scikit-learn (1.0.2), Librosa (0.10.1), OpenCV (4.8.0.76), and Scipy (1.11.2).

As stated earlier, a specialized version of YOLO was used, and the YOLOv8s model can be obtained in Ultralytics [60]. Moreover, to analyze the convolutional layers of the neural network in detail, and to have a clear understanding of the internals of the neural networks, representation software (GradCAM [61]) was adapted to work with Pytorch and YOLOv8. Finally, the AltCLIP model and its pipeline and pre-processing functions were used from the Huggingface Transformers library [62].

3. Results

3.1. Experiment 1

The models were trained through 40 cycles (i.e., epochs), where data were gradually passed through the feature extraction and classification process in small batches of eight samples. Additionally, data augmentation parameters at default values and early stopping at 10 epochs diminished the chances of overfitting. Figure 9 shows the validation and training loss for each epoch; in the same fashion, mAP50 and mAP50:95 are shown in Figure 10.

The results presented in Table 2 show the calculated precision, recall, mAP50, and mAP50:95 for the validation set of each fold, with Model 9 obtaining the best results.

Table 3 displays the precision, recall, and F1-score for the test subset evaluation using a 0.7 confidence threshold and an IoU threshold of 0.7.

Internal Representation of the Neural Network

Figure 11 shows three manatees and various logs. This image is passed through the model to obtain the results showing the feature maps of some of the layers explained in Section 2.4, generated by GradCAM [61].

3.2. Experiment 2

The best model from Experiment 1 was used with SAHI in this experiment. The tile size was kept at 640 × 640 to comply with the training parameters of the original YOLO model, with 10% vertical and horizontal overlap. Inference was conducted on each sequence of images using an IoU of 0.5 and a confidence threshold of 0.7. If a manatee was detected in any of the corresponding images, then its sequence was also detected.

The degree of detection success is measured by the average metrics for each sequence in Table 4 and Table 5. In these tables, the label “Many manatees” is used to describe a group of three or more manatees.

The detection process varies significantly when contrasting the results from Table 4 and Table 5. Although a positive detection is made in most cases, sometimes it is not. This detection is related to the number of images. For instance, when the number is high, say >3, the detection is made; for a group of fewer than three images, sometimes the opposite is concluded.

More importantly, this suggests that detection is not related to altitude, as there is a mixture of excellent and poor detection at different altitudes. However, there are some locations where the detection algorithm works more efficiently and could be studied later.

To visualize the results of Experiment 2, using the average GPS coordinates (latitude and longitude), detailed maps of the locations of the detected and undetected sequences are illustrated in Figure 12 and Figure 13 for the training and testing sequences, respectively.

It is important to mention that for both maps, blue points depict sequences of detected manatees, and red points show the nondetected manatees, basically, model errors.

3.3. Experiment 3

The evaluation metrics for the ZSL classification model based on AltCLIP (described in Section 2.6), using the text prompts previously shown in Table 1, provide the confusion matrix shown in Figure 14.

Inference was conducted on a subset of the dataset (detailed in Section 2.3). It comprises 316 images with manatees and 122 images of logs and background scenery. The evaluation metrics for this model are shown in Table 6.

4. Discussion

Our research addresses the question of whether drone images from the Panama Canal Basin can be used to training YoloV8 detection and SZL classification models. The results regarding YOLOv8’s capabilities to detect manatees indicate that the model can identify most instances of manatees in the monitored areas of the Panama Canal. As shown in Experiment 1, the objective was to assess the object detection performance under ideal conditions using small, low-resolution images where most manatees were easily visualized. The loss per epoch plots shown in Figure 9 and Figure 10 indicate that the model struggled during the initial epochs due to fluctuations in error across specific batches of images; this could be attributed to the high variability among some manatee sequences. However, the model produces stable metrics at the end of the training cycle, reaching high precision (93%) on known data in all classes (Table 2).

Looking at the performance of the model in unseen data, Table 2 shows a very low recall (43.8%) in the test set, indicating the presence of false negatives, i.e., undetected manatees (Figure 15A). Moreover, the precision value (86.2%) indicates a higher number of false positives (Figure 15B). This could be due to environmental factors, such as the abundance of undesirable objects (see Figure 16). These objects can introduce inference errors, such as underwater soil that creates patterns that are difficult to assess by the model (shown in Figure 16A), or water surface reflections and underwater logs, which could be mistaken for a floating manatee (seen in Figure 16B,C).

Experiment 2 reveals insights into the actual accuracy of YOLOv8 in a more complex setup, specifically, high-altitude shots. Table 4 shows that manatees were detected in 28 of 30 sequences. It is interesting to note that most of the undetected sequences involved higher-altitude shots in complex scenery. Examples of this complexity are shown in Figure 17A, which depicts a calf-mother pair that can barely be identified due to the high altitude (199 m). In contrast, a clear case of an image where manatees are present in a complex environment is shown in Figure 17B, illustrating two manatees swimming in a feeding spot, but their characteristic fish shape is barely noticeable. It is also worth mentioning that this image is unique among all datasets.

Manatees were detected in 11 of the 15 sequences in the test set. Figure 17C is an example of easy detection, that is, images which are less complicated to understand by the model, as they differ the most from the training set samples. However, low performance is expected in the detection of manatees in some images, because the model struggles to fully learn the features from unseen data (primarily because the test set comprises more complex shots); one of these more complex examples is sequence 10 (shown in Figure 17D). Given these results, it is safe to conclude that the model can detect most manatees in favorable lighting conditions. However, lower performance is expected at high altitudes or in very complicated shots with unfavorable lighting or environmental conditions, a common issue with drone footage.

The ZSL classification scheme uses text prompts with AltCLIP. The metrics shown in Table 6 indicate that the model can identify individual manatees, logs, and background images, all with high F1-scores. However, it struggles to correctly count mother–calf pairs and groupings of more than two manatees. The following factors could explain the lower performance when trying to identify multiple manatees:

In some scenarios, groups of multiple manatees with precise shapes and morphological characteristics are not visible as shown in Figure 18. The characteristic fish-like shape can be evident (Figure 18C) or less clear when viewed from higher altitudes (Figure 18A), leading to classification errors that result in three classes (logs, mother–calf pairs, and groups of manatees), having very similar probabilities of detection (Table 7)
In some cases, when a manatee is not visible due to lighting or environmental conditions (as seen in Figure 18B), adjacent logs could be mistaken for manatees, given by the critical condition established in the text prompt for the mother–calf class, where one animal should be smaller than the other (see Table 1).
There is a clear overlap between the classes, especially when counting multiple animals. This is reflected in the output probabilities (Figure 18C and Table 7), where both mother–calf pairs and many manatees are plausible classes.

5. Conclusions

This work assessed the abilities of YOLOv8 in the detection of manatees under various environmental conditions and geographical locations. The results showed decent output metrics, achieving a mean accuracy of 70 to 90%. This indicates that despite the high variability in the scenery and appearance of manatees, it is possible to train a model for manatee detection. Further exploration of data-augmentation techniques could enhance the model by evaluating different combinations of image transformation variables, such as perspective, rotation, and color.

Exploring more complex model architectures, such as YOLOv9 [63] or YOLO-NAS [64], along with various hyperparameter configurations, may yield better results. However, we advocate the use of YOLOv8 due to its direct compatibility with SAHI and its ability to improve small object detection performance. The size of the model must also be taken into account, as it is a crucial factor in the implementation of a drone portable real-time system [18]. Despite class overlap and the complexities of counting multiple manatees of different sizes, AltCLIP achieves a high top-2 accuracy close to 90% (Table 6). It is also worth noting the usage and accessibility of zero-shot learning, as it does not require training procedures and allows users to focus on feature pre-processing rather than dealing with potential computational constraints. This is especially convenient in situations of low data availability, where adding more high-quality data is very difficult and time consuming, although AltCLIP can struggle to distinguish multiple manatees or mother–calf pairs due to similar morphological features and overlapping “visual cues”. According to Metzen et al. [65], a method known as AutoCLIP could be used to automatically adjust the importance of each text prompt based on how well it matches the image. This could help AltCLIP better differentiate between visually similar classes by emphasizing key distinguishing features (e.g., “two manatees swimming together” vs. “a single manatee”) at inference time.

Furthermore, experimenting with different text features could produce better results. Using different languages introduces another level of complexity when describing the scenery and the manatee morphological characteristics, as contextual descriptions vary in languages such as Spanish. Finally, more advanced models, including large language models (LLMs), vision language models (VLMs), and multi-models, could be explored [66,67,68,69].

Looking ahead, future research should prioritize improving detection models to improve accuracy, especially in identifying adult–calf pairs, which are vital for understanding reproductive dynamics. In addition, recognizing manatees traveling in groups is critical for precise population assessments and for projecting future trends and the overall population in this basin. Moreover, refining the process to determine if the same individual appears in multiple images, and avoiding duplicate counts by temporal tracking or spatial–temporal analysis techniques are essential for accurate estimations [49,70].

This study is of particular importance, as it provides the first comprehensive insight into the manatee population in the Panama Canal Basin by using a well-established computer vision methodology for detecting and classifying manatees in the Panama Canal Basin. This methodology could be used to track individual movements, assess habitat use, and gauge the impacts of wetland deforestation and changes in land use occurring in the basin. Finally, the work sets the stage for future ecological research and conservation strategies for this endangered IUCN species in Panama.

Author Contributions

Conceptualization, J.E.S.-G., F.M. and H.M.G.; Data curation, K.C. and H.M.G.; Funding acquisition, F.M. and H.M.G.; Investigation, F.M. and H.M.G.; Methodology, J.E.S.-G., K.C., A.D., F.M. and H.M.G.; Project administration, F.M. and H.M.G.; Resources, H.P. and H.M.G.; Software, K.C. and A.D.; Supervision, J.E.S.-G. and F.M.; Writing—original draft, J.E.S.-G., K.C. and H.M.G.; Writing—review and editing, J.E.S.-G., H.P., F.M. and H.M.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Secretaría Nacional de Ciencia, Tecnología e Innovación de Panamá (SENACYT) via Projects FID18-076, FID21-090, and FID23-106. Research activities of authors J.E.S.-G., H.P., F.M. and H.M.G., are supported by the Sistema Nacional de Investigación (SNI) of SENACYT.

Data Availability Statement

Access to the data used in this study is available upon reasonable request to the corresponding author.

Acknowledgments

The authors our gratitude to the Panama Canal Authority (ACP) for allowing us to conduct research on Gatun Lake. We thank Ángel Tribaldos and Andy García for their logistical support, as well as Daniel Veliz for providing administrative assistance on behalf of ACP. We thank drone pilots Nicolas Luke from Scorpion Drones and Milton Garcia from the Smithsonian Tropical Research Institute (STRI) for completing all flights. Carlos Guevara and José L. Rodriguez coordinated and provided all logistical support. We thank R. Estevez, Janeth Jaen, and Silvia Melchiori for their field assistance. The authors also express their gratitude for the administrative support provided by Universidad Tecnológica de Panamá, the Centro de Estudios Multidisciplinarios en Ciencias, Ingeniería y Tecnología (CEMCIT-AIP), and the Smithsonian Tropical Research Institute.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mignucci-Giannoni, A.A.; González-Socoloske, D.; Aquarium, C.; Aquino, J.; Caicedo-Herrera, D.; Castelblanco-Martínez, D.N.; Claridge, D.; Corona-Figueroa, F.; Debrot, A.; de Thoisy, B.; et al. What’s in a Name? Standardization of Vernacular Names for Trichechus manatus. Caribb. Nat. 2024, 98, 1–17. [Google Scholar]
Self-Sullivan, C.; Mignucci-Giannoni, A.A. West Indian manatees (Trichechus manatus) in the wider Caribbean region. In Sirenian Conservation: Issues and Strategies in Developing Countries; University Press of Florida: Gainsville, FL, USA, 2012. [Google Scholar]
Deutsch, C.; Self-Sullivan, C.; Mignucci-Giannoni, A.; Trichechus manatus. The IUCN Red List of Threatened Species 2008: E. T22103A9356917. 2008. Available online: https://www.iucnredlist.org/species/22103/43792740 (accessed on 1 March 2025).
Balensiefer, D.C.; Attademo, F.L.N.; Sousa, G.P.; Freire, A.C.d.B.; da Cunha, F.A.G.C.; Alencar, A.E.B.; Silva, F.J.d.L.; Luna, F.d.O. Three decades of Antillean manatee (Trichechus manatus manatus) stranding along the Brazilian coast. Trop. Conserv. Sci. 2017, 10, 1940082917728375. [Google Scholar]
Guzman, H.M.; Condit, R. Abundance of manatees in Panama estimated from side-scan sonar. Wildl. Soc. Bull. 2017, 41, 556–565. [Google Scholar]
Diaz-Ferguson, E.; Hunter, M.; Guzmán, H.M. Genetic composition and connectivity of the Antillean manatee (Trichechus manatus manatus) in Panama. Aquat. Mamm. 2017, 43, 378–386. [Google Scholar] [CrossRef]
Wake, T.A.; Doughty, D.R.; Kay, M. Archaeological investigations provide late Holocene baseline ecological data for Bocas del Toro, Panama. Bull. Mar. Sci. 2013, 89, 1015–1035. [Google Scholar]
Muschett, G.; Vianna, J.A. Distribution and abundance of the West Indian manatee (Trichechus manatus) in the Panama Canal. bioRxiv 2015. [Google Scholar] [CrossRef]
Merchan, F.; Echevers, G.; Poveda, H.; Sanchez-Galan, J.E.; Guzman, H.M. Detection and identification of manatee individual vocalizations in Panamanian wetlands using spectrogram clustering. J. Acoust. Soc. Am. 2019, 146, 1745–1757. [Google Scholar]
Merchan, F.; Guerra, A.; Poveda, H.; Guzmán, H.M.; Sanchez-Galan, J.E. Bioacoustic classification of antillean manatee vocalization spectrograms using deep convolutional neural networks. Appl. Sci. 2020, 10, 3286. [Google Scholar] [CrossRef]
Guzman, H.M.; Real, C.K. Have Antillean manatees (Trichechus manatus manatus) entered the Eastern Pacific Ocean? Mar. Mammal Sci. 2023, 39, 274–280. [Google Scholar]
Mou Sue, L.; Chen, D.H.; Bonde, R.K.; O’Shea, T.J. Distribution and status of manatees (Trichechus manatus) in Panama. Mar. Mammal Sci. 1990, 6, 234–241. [Google Scholar]
Reynolds III, J.E.; Szelistowski, W.A.; León, M.A. Status and conservation of manatees Trichechus manatus manatus in Costa Rica. Biol. Conserv. 1995, 71, 193–196. [Google Scholar]
Snaddon, J.; Petrokofsky, G.; Jepson, P.; Willis, K.J. Biodiversity technologies: Tools as change agents. Biol. Lett. 2013, 9. [Google Scholar] [CrossRef] [PubMed]
Reyes-Arias, J.D.; Brady, B.; Ramos, E.A.; Henaut, Y.; Castelblanco-Martínez, D.N.; Maust-Mohl, M.; Searle, L.; Pérez-Lachaud, G.; Guzmán, H.M.; Poveda, H.; et al. Vocalizations of wild West Indian manatee vary across subspecies and geographic location. Sci. Rep. 2023, 13, 11028. [Google Scholar]
Ríos, E.; Merchan, F.; Higuero, R.; Poveda, H.; Sanchez-Galan, J.E.; Ferré, G.; Guzman, H.M. Manatee vocalization detection method based on the autoregressive model and neural networks. In Proceedings of the 2021 IEEE Latin-American Conference on Communications (LATINCOM), Santo Domingo, Dominican Republic, 17–19 November 2021; pp. 1–6. [Google Scholar]
Merchan, F.; Contreras, K.; Poveda, H.; Guzman, H.M.; Sanchez-Galan, J.E. Unsupervised identification of Greater Caribbean manatees using Scattering Wavelet Transform and Hierarchical Density Clustering from underwater bioacoustics recordings. Front. Mar. Sci. 2024, 11, 1416247. [Google Scholar]
Ríos, E.; Merchan, F.; Poveda, H.; Sanchez-Galan, J.E.; Guzman, H.M.; Ferré, G. Edge Computing Applied on Real-Time Manatee Detection Using Microcontrollers. In Proceedings of the 2023 IEEE Latin-American Conference on Communications (LATINCOM), Panama City, Panama, 15–17 November 2023; pp. 1–6. [Google Scholar]
Contreras, K.; Merchan, F.; Poveda, H.; Guzmán, H.M.; Sanchez-Galan, J.E. Construction of a Data Integration Platform for the Passive Monitoring of the Antillean Manatee in Panama. In Proceedings of the 2023 IEEE Latin-American Conference on Communications (LATINCOM), Panama City, Panama, 15–17 November 2023; pp. 1–6. [Google Scholar]
Martin, J.; Edwards, H.H.; Burgess, M.A.; Percival, H.F.; Fagan, D.E.; Gardner, B.E.; Ortega-Ortiz, J.G.; Ifju, P.G.; Evers, B.S.; Rambo, T.J. Estimating distribution of hidden objects with drones: From tennis balls to manatees. PLoS ONE 2012, 7, e38882. [Google Scholar] [CrossRef]
Rodofili, E.N.; Lecours, V. Automatically Counting Florida Manatees (Trichechus manatus latirostris) from Drone Images Using Object-Based Image Analysis. Aquat. Mamm. 2024, 50, 549–568. [Google Scholar]
Hodgson, A.; Peel, D.; Kelly, N. Unmanned aerial vehicles for surveying marine fauna: Assessing detection probability. Ecol. Appl. 2017, 27, 1253–1267. [Google Scholar] [CrossRef]
Oliveira-da Costa, M.; Marmontel, M.; Da-Rosa, D.S.; Coelho, A.; Wich, S.; Mosquera-Guerra, F.; Trujillo, F. Effectiveness of unmanned aerial vehicles to detect Amazon dolphins. Oryx 2020, 54, 696–698. [Google Scholar]
Edwards, H.H.; Hostetler, J.A.; Stith, B.M.; Martin, J. Monitoring abundance of aggregated animals (Florida manatees) using an unmanned aerial system (UAS). Sci. Rep. 2021, 11, 1–12. [Google Scholar]
Ramos, E.A.; Landeo-Yauri, S.; Castelblanco-Martínez, N.; Arreola, M.R.; Quade, A.H.; Rieucau, G. Drone-based photogrammetry assessments of body size and body condition of Antillean manatees. Mamm. Biol. 2022, 102, 765–779. [Google Scholar] [CrossRef]
Colefax, A.P.; Butcher, P.A.; Kelaher, B.P. The potential for unmanned aerial vehicles (UAVs) to conduct marine fauna surveys in place of manned aircraft. ICES J. Mar. Sci. 2018, 75, 1–8. [Google Scholar] [CrossRef]
Dujon, A.M.; Ierodiaconou, D.; Geeson, J.J.; Arnould, J.P.; Allan, B.M.; Katselidis, K.A.; Schofield, G. Machine learning to detect marine animals in UAV imagery: Effect of morphology, spacing, behaviour and habitat. Remote Sens. Ecol. Conserv. 2021, 7, 341–354. [Google Scholar] [CrossRef]
Zhang, H.; Zhang, Q.; Nguyen, P.A.; Lee, V.C.; Chan, A. Chinese White Dolphin Detection in the Wild. In Proceedings of the 3rd ACM International Conference on Multimedia in Asia; Gold Coast, Australia, 1–3 December 2021; pp. 1–5. [Google Scholar]
Sharma, N.; Scully-Power, P.; Blumenstein, M. Shark detection from aerial imagery using region-based CNN, a study. In Proceedings of the 31st Australasian Joint Conference, Wellington, New Zealand, 11–14 December 2018; pp. 224–236. [Google Scholar]
Sankaran, S. Multi-Species Object Detection in Drone Imagery for Population Monitoring of Endangered Animals. arXiv 2024, arXiv:2407.00127. [Google Scholar]
Efremov, V.; Leus, A.; Gavrilov, D.; Mangazeev, D.; Zuev, V.; Kholodnyak, I.; Vodichev, N.; Parshikov, M. System for Automatic Counting of Atlantic Walruses Using Neural Networks from UAV Aerial Photography Images. In Proceedings of the 2024 26th International Conference on Digital Signal Processing and its Applications (DSPA), Moscow, Russian Federation, 27–29 March 2024; pp. 1–6. [Google Scholar]
Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A small-object-detection model based on improved YOLOv8 for UAV aerial photography scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef] [PubMed]
Liu, C.; Meng, F.; Zhu, Z.; Zhou, L. Object Detection of UAV Aerial Image based on YOLOv8. Front. Comput. Intell. Syst. 2023, 5, 46–50. [Google Scholar]
Sen, C.; Singh, P.; Gupta, K.; Jain, A.K.; Jain, A.; Jain, A. UAV Based YOLOV-8 Optimization Technique to Detect the Small Size and High Speed Drone in Different Light Conditions. In Proceedings of the 2024 2nd International Conference on Disruptive Technologies (ICDT), Greater Noida, India, 15–16 March 2024; pp. 1057–1061. [Google Scholar]
Mount, R. Acquisition of through-water aerial survey images. Photogramm. Eng. Remote Sens. 2005, 71, 1407–1415. [Google Scholar] [CrossRef]
Akyon, F.C.; Altinuc, S.O.; Temizel, A. Slicing Aided Hyper Inference and Fine-tuning for Small Object Detection. arXiv 2022, arXiv:2202.06934. [Google Scholar]
Muzammul, M.; Algarni, A.M.; Ghadi, Y.Y.; Assam, M. Enhancing UAV aerial image analysis: Integrating advanced SAHI techniques with real-time detection models on the VisDrone dataset. IEEE Access 2024, 12, 21621–21633. [Google Scholar]
Lawrence, B.; de Lemmus, E.; Cho, H. UAS-Based Real-Time Detection of Red-Cockaded Woodpecker Cavities in Heterogeneous Landscapes Using YOLO Object Detection Algorithms. Remote Sens. 2023, 15, 883. [Google Scholar] [CrossRef]
Yin, S.; Jiang, L. Distilling knowledge from multiple foundation models for zero-shot image classification. PLoS ONE 2024, 19, e0310730. [Google Scholar]
Brown, T.B. Language models are few-shot learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning Research (PMLR), Virtual Event., 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Chen, Z.; Liu, G.; Zhang, B.W.; Ye, F.; Yang, Q.; Wu, L. Altclip: Altering the language encoder in clip for extended language capabilities. In Findings of the Association for Computational Linguistics; Toronto, Canada; ACL: Stroudsburg, PA, USA, 2023; pp. 8666–8682. [Google Scholar]
McCullough, D. The Path Between the Seas: The Creation of the Panama Canal, 1870–1914; Simon and Schuster: New York, NY, USA, 1978. [Google Scholar]
Carse, A. Beyond the Big Ditch: Politics, Ecology, and Infrastructure at the Panama Canal; MIT Press: Cambridge, MA, USA, 2014. [Google Scholar]
MacLaren, J.P. Manatees as a naturalistic biological mosquito control method. Mosq. News 1967, 27, 387–393. [Google Scholar]
Schad, R.; Montgomery, G.; Chancellor, D. La distribución y frecuencia del manatí en el Lago Gatun y en el canal de Panamá. ConCiencia 1981, 8, 1–4. [Google Scholar]
Montgomery, G.; Gale, N.; Murdoch, W. Have manatee entered the eastern Pacific Ocean? Mammalia 1982, 46, 257–258. [Google Scholar]
Muschett, G. The Manatees of the Panama Canal Watershed: A Study of Sentinel Species Abundance, Habitat Use and Genetics. In Final Report to Rufford Small Grant for Nature Conservation; The Rufford Foundation: London, UK, 2009; p. 14. [Google Scholar]
Yasir, M.; Liu, S.; Pirasteh, S.; Xu, M.; Sheng, H.; Wan, J.; de Figueiredo, F.A.; Aguilar, F.J.; Li, J. YOLOShipTracker: Tracking ships in SAR images using lightweight YOLOv8. Int. J. Appl. Earth Obs. Geoinf. 2024, 134, 104137. [Google Scholar] [CrossRef]
Ye, R.; Gao, Q.; Qian, Y.; Sun, J.; Li, T. Improved yolov8 and sahi model for the collaborative detection of small targets at the micro scale: A case study of pest detection in tea. Agronomy 2024, 14, 1034. [Google Scholar] [CrossRef]
Guan, X.; Guan, Z.; Zhu, S.; Chen, B. Research on the Application of YOLOv8 Model Based on ODConv and SAHI Optimization in Dense Small Target Crowd Detection. In Proceedings of the 2024 IEEE 2nd International Conference on Control, Electronics and Computer Technology (ICCECT), Jilin, China, 26–28 April 2024; pp. 726–732. [Google Scholar]
Zhorif, N.N.; Anandyto, R.K.; Rusyadi, A.U.; Irwansyah, E. Implementation of Slicing Aided Hyper Inference (SAHI) in YOLOv8 to Counting Oil Palm Trees Using High-Resolution Aerial Imagery Data. Int. J. Adv. Comput. Sci. Appl. 2024, 15, 869–874. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
Lee, Y.H.; Kim, H.J. Comparative Analysis of YOLO Series (from V1 to V11) and Their Application in Computer Vision. J. Semicond. Disp. Technol. 2024, 23, 190–198. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Oh, J.; Hong, M.C. Low-light image enhancement using hybrid deep-learning and mixed-norm loss functions. Sensors 2022, 22, 6904. [Google Scholar] [CrossRef] [PubMed]
Aguilar, G.; Naranjo, L. The Panama Canal: The 2015–2016 El Niño. In El Niño Ready Nations and Disaster Risk Reduction: 19 Countries in Perspective; Springer: Cham, Switzerland, 2022; pp. 347–360. [Google Scholar]
Gobbetti, L.C.; Córdoba, K.R. Impacto de la ampliación del Canal de Panamá en las confiabilidades hídrica y de calado. I+D Tecnol. 2024, 20, 49–60. [Google Scholar]
Ultralytics. YOLOv5: A State-of-the-Art Real-Time Object Detection System. 2021. Available online: https://docs.ultralytics.com (accessed on 1 March 2025).
Jocher, G.; Chaurasia, A.; Qiu, J. YOLO by Ultralytics. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 1 March 2025).
Gildenblat, J.; Contributors. PyTorch Library for CAM Methods. 2021. Available online: https://github.com/jacobgil/pytorch-grad-cam (accessed on 1 March 2025).
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; Association for Computational Linguistics; pp. 38–45. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. Yolov9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Terven, J.; Córdova-Esparza, D.M.; Romero-González, J.A. A comprehensive review of yolo architectures in computer vision: From yolov1 to yolov8 and yolo-nas. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Metzen, J.H.; Saranrittichai, P.; Mummadi, C.K. Autoclip: Auto-tuning zero-shot classifiers for vision-language models. arXiv 2023, arXiv:2309.16414. [Google Scholar]
Yang, Z.; Li, L.; Lin, K.; Wang, J.; Lin, C.C.; Liu, Z.; Wang, L. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv 2023, arXiv:2309.17421. [Google Scholar]
Zhang, S.; Sun, P.; Chen, S.; Xiao, M.; Shao, W.; Zhang, W.; Liu, Y.; Chen, K.; Luo, P. Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv 2023, arXiv:2307.03601. [Google Scholar]
Tong, S.; Liu, Z.; Zhai, Y.; Ma, Y.; LeCun, Y.; Xie, S. Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 9568–9578. [Google Scholar]
Yu, W.; Yang, Z.; Li, L.; Wang, J.; Lin, K.; Liu, Z.; Wang, X.; Wang, L. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv 2023, arXiv:2308.02490. [Google Scholar]
Yeom, S.; Nam, D.H. Moving vehicle tracking with a moving drone based on track association. Appl. Sci. 2021, 11, 4046. [Google Scholar] [CrossRef]

Figure 1. Study location in the Panama Canal Basin. In red, each polygon (44 in total) represents areas where images were captured.

Figure 2. Examples of training and testing sequences. Figures (A,B) portray manatees from the training dataset while (C,D) are some of the sequences used for testing.

Figure 4. Examples of both target classes. The left image presents the original version, while the right image presents its corresponding labeled version.

Figure 5. Examples of undesired background objects and landscape. (Top left) Trees on land. (Top right) Households. (Bottom left) Unfocused lake image. (Bottom right) Unfocused logs.

Figure 6. Examples of data augmentation transformations made by YOLOv8 during training. The model was trained with manatees labeled as zeros and logs as ones.

Figure 7. Simplified architecture and functioning of the AltCLIP model. Input Images and Text enter the encoders creating a multidimensional embedding, from which similiraty is calculated, later class assignments are calculated via a SoftMax function and later translated into probabilities.

Figure 8. Figure (A) depicts the original RGB image while (B) shows the luminance channel of the same image.

Figure 9. Training and validation loss per epoch for the best detection model.

Figure 10. Mean average precision per epoch for the best detection model.

Figure 11. Internal representation of some feature maps from the YOLOV8 architecture when inferring an input image.

Figure 12. Detailed geographic map with the average locations of each manatee sequence for the training set.

Figure 13. Detailed geographic map with the average locations of each manatee sequence for the testing set.

Figure 14. Confusion matrix generated from the inference of ZSL classification model for Experiment 3.

Figure 15. Examples of false positives and false negatives for the manatee class.

Figure 16. Images of misclassification sources for detection models.

Figure 17. Cropped samples of images from undetected sequences, (A,B) belong to the training set (Table 4) while (C,D) to the testing set (Table 5).

Figure 18. Examples of misclassifications and class overlap by the zero-shot model. (A) The algorithm predicted logs while the image a group of manatees in a line. (B) While the Algorithm predicts Mother-Calf pair, the image is just a single manatee. (C) While is true that there are many manatees, the image shows a Mother-Calf pair plus a single manatee.

Table 1. Descriptions of the different classes.

Class Name	Class Prompt
One Manatee	Single animal with the shape of a fish swimming alone.
Mother–calf Pair	Small animal and another one bigger both with the shape of a fish swimming together.
Many Manatees	More than two animals that can be adults or mother–calf combinations with the shape of a fish, swimming together or close by.
Logs	Pieces of floating logs without the shape of a fish.
Background	Background image, such as terrain or empty water where there are no animals with the shape of a fish.

Table 2. Performance metrics for 10-fold cross-validation.

Model	Precision	Recall	mAP50	mAP50-95	Fitness
1	0.811	0.599	0.675	0.339	0.373
2	0.665	0.742	0.787	0.437	0.472
3	0.670	0.692	0.718	0.337	0.375
4	0.846	0.625	0.819	0.416	0.457
5	0.737	0.637	0.753	0.426	0.459
6	0.901	0.617	0.759	0.439	0.471
7	0.686	0.742	0.805	0.411	0.451
8	0.810	0.761	0.828	0.459	0.496
9	0.930	0.858	0.922	0.652	0.679
10	0.693	0.754	0.760	0.359	0.399

Table 3. Evaluation metrics for Model 9 on the test subset.

Class	Precision	Recall	F1-Score
Manatee	0.862	0.438	0.581
Logs	0.500	0.072	0.126
All Classes	0.829	0.242	0.375

Table 4. Manatee detection performance by sequence, including GPS altitude.

Seq.	Manatee Label	Detected	Images	Precision	Recall	F1	Altitude
S01	one manatee	no	1	0.000	0.000	0.000	211.589
S02	calf–mother	no	2	0.000	0.000	0.000	199.922
S03	many manatees	yes	6	0.207	0.522	0.296	148.777
S04	one manatee	yes	5	0.167	0.200	0.182	136.790
S05	two manatees	yes	1	1.000	0.500	0.667	206.276
S06	one manatee	yes	6	0.750	1.000	0.857	141.418
S07	two manatees	no	1	0.000	0.000	0.000	130.775
S08	one manatee	yes	6	1.000	1.000	1.000	127.664
S09	one manatee	yes	6	0.833	0.833	0.833	124.190
S10	one manatee	yes	12	0.379	0.917	0.537	65.789
S11	one manatee	yes	8	0.667	1.000	0.800	127.643
S12	many manatees	yes	3	1.000	1.000	1.000	128.209
S13	many manatees	yes	5	1.000	0.929	0.963	67.245
S14	one manatee	yes	4	1.000	0.700	0.824	128.180
S15	calf–mother	yes	5	0.385	0.556	0.455	128.878
S16	many manatees	yes	5	0.500	0.600	0.545	132.344
S17	one manatee	yes	4	1.000	1.000	1.000	132.399
S18	one manatee	yes	5	0.375	0.600	0.462	132.551
S19	calf–mother	yes	6	0.643	0.750	0.692	127.470
S20	one manatee	yes	6	1.000	1.000	1.000	155.371
S21	two manatees	yes	6	0.450	0.563	0.500	155.367
S22	many manatees	yes	6	0.490	0.862	0.625	155.344
S23	one manatee	yes	5	0.231	0.600	0.333	155.398
S24	calf–mother	yes	6	0.750	0.750	0.750	116.213
S25	one manatee	yes	2	1.000	0.500	0.667	134.825
S26	one manatee	yes	6	0.750	1.000	0.857	115.564
S27	calf–mother	yes	6	0.500	0.300	0.375	127.264
S28	one manatee	yes	3	0.067	0.333	0.111	127.187
S29	one manatee	yes	2	1.000	1.000	1.000	118.834
S30	calf–mother	yes	3	0.500	0.250	0.333	89.591

Table 5. Performance metrics for different sequences of manatee detection, including average GPS altitude.

Seq.	Manatee Label	Detected	Images	Precision	Recall	F1	Altitude
S01	one manatee	no	1	0.000	0.000	0.000	216.230
S02	one manatee	yes	2	0.333	0.500	0.400	328.000
S03	one manatee	no	1	0.000	0.000	0.000	328.000
S04	one manatee	yes	1	0.500	1.000	0.667	328.000
S05	one manatee	yes	1	0.250	1.000	0.400	326.000
S06	many manatees	yes	2	1.000	1.000	1.000	149.752
S07	one manatee	yes	1	0.333	0.667	0.444	118.269
S08	many manatees	yes	1	1.000	1.000	1.000	123.464
S09	calf–mother	no	1	0.000	0.000	0.000	123.464
S10	one manatee	no	1	0.000	0.000	0.000	115.292
S11	one manatee	yes	5	0.064	0.600	0.115	114.852
S12	one manatee	yes	5	0.130	0.636	0.215	114.772
S13	many manatees	yes	6	0.421	0.571	0.485	114.872
S14	one manatee	yes	2	0.105	1.000	0.190	114.772
S15	one manatee	yes	5	0.071	0.600	0.128	115.981

Table 6. Classification report including support, precision, recall, F1-score, accuracy, and Top-2 accuracy for each class, with a summary for all classes.

Class	Samples	Precision	Recall	F1	Acc. Top-1	Acc. Top-2
one manatee	113	0.697	0.894	0.783	-	-
mother–calf pair	61	0.489	0.377	0.426	-	-
many manatees	20	0.538	0.350	0.424	-	-
logs	70	0.786	0.786	0.786	-	-
background	52	0.902	0.712	0.796	-	-
All classes	316	0.705	0.705	0.705	0.706	0.889

Table 7. Probability scores for images shown in Figure 18 across all classes.

Image	Logs	Mother–Calf Pair	Many Manatees	One Manatee	Background
Image A	0.208	0.203	0.201	0.195	0.193
Image B	0.193	0.204	0.199	0.202	0.201
Image C	0.195	0.205	0.206	0.201	0.193

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sanchez-Galan, J.E.; Contreras, K.; Denoce, A.; Poveda, H.; Merchan, F.; Guzmán, H.M. Drone-Based Detection and Classification of Greater Caribbean Manatees in the Panama Canal Basin. Drones 2025, 9, 230. https://doi.org/10.3390/drones9040230

AMA Style

Sanchez-Galan JE, Contreras K, Denoce A, Poveda H, Merchan F, Guzmán HM. Drone-Based Detection and Classification of Greater Caribbean Manatees in the Panama Canal Basin. Drones. 2025; 9(4):230. https://doi.org/10.3390/drones9040230

Chicago/Turabian Style

Sanchez-Galan, Javier E., Kenji Contreras, Allan Denoce, Héctor Poveda, Fernando Merchan, and Hector M. Guzmán. 2025. "Drone-Based Detection and Classification of Greater Caribbean Manatees in the Panama Canal Basin" Drones 9, no. 4: 230. https://doi.org/10.3390/drones9040230

APA Style

Sanchez-Galan, J. E., Contreras, K., Denoce, A., Poveda, H., Merchan, F., & Guzmán, H. M. (2025). Drone-Based Detection and Classification of Greater Caribbean Manatees in the Panama Canal Basin. Drones, 9(4), 230. https://doi.org/10.3390/drones9040230

Article Menu

Drone-Based Detection and Classification of Greater Caribbean Manatees in the Panama Canal Basin

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Location

2.2. Data Collection

2.3. Data Organization

2.4. Detection Model

2.5. Training and Validation Parameters

2.6. Zero-Shot Classification Model

2.7. Model Evaluation Metrics

2.8. Experimental Setup

2.9. Computational Hardware and Software

3. Results

3.1. Experiment 1

Internal Representation of the Neural Network

3.2. Experiment 2

3.3. Experiment 3

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI