1. Introduction
The Great Caribbean manatee,
Trichechus manatus manatus, known as the Antillean Manatee [
1], is an aquatic herbivorous mammal that inhabits tropical and subtropical wetlands and rivers, ranging from Mexico to Brazil and the Western Caribbean [
2]. It was considered an endangered species by the International Union for the Conservation of Nature (IUCN), due to a decrease in the regional population more than a decade ago [
3]. This decrease has been attributed to various threats, including habitat degradation, illegal hunting, boat collisions, and low genetic variability [
4,
5,
6]. Understanding habitat use and accurately estimating population size are fundamental to developing effective protection and conservation policies.
In Panama, manatees have been under legal protection since 1967 [
7]. It is a species traditionally linked to the Caribbean costs of Panama, specifically in the turbid brackish waters covered by thick aquatic vegetation in the wetlands of the Changuinola and San San Pond Sak rivers, in the province of Bocas del toro (Western Panama).
Side-scan sonar has been shown to be effective in the counting and mapping of manatees, revealing seasonal variations in abundance [
5,
8]. More importantly, bioacoustic methods have determined that between 18 and 33 individuals are found in [
9,
10].
However, the discovery of a significant population in the Panama Canal Basin is groundbreaking [
11]. Recent studies now reveal that the Canal Basin harbors approximately 20 to 25 individuals, a number that is comparable to those that inhabit the rivers in Bocas del Toro. This novel finding opens a new chapter in understanding the distribution of manatees and underscores the importance of innovative survey techniques to estimate changes in their population number, as well as to understand how local manatees use their habitat and where they can be observed seasonally and annually around the Basin.
As stated above, research efforts have been proposed to improve estimates and conserve manatees using non-invasive methods regionally and locally, such as sonar [
5], aerial surveys using high-wing airplanes [
12,
13], and extensive acoustical monitoring [
9]. Consequently, we have devised digital signal processing and machine learning approaches for passive monitoring using bioacoustical recordings [
14]. Specifically, we have focused on understanding the acoustic properties of manatee vocalizations [
15]. This understanding has been translated into methods to facilitate the detection, identification, and counting of unique individuals using various machine learning techniques [
9,
10,
16,
17]. These methods have been used to implement an edge computing device for real-time acoustic detection [
18] and a data analysis platform [
19] to estimate the demographics of manatees in Panama. However, crewed aircraft surveys face significant logistic and cost-efficiency challenges due to the nature of tropical wetlands, particularly when taking aerial images of turbid brackish waters covered in dense aquatic vegetation [
5,
9].
Using UAVs for marine wildlife research have gained popularity in recent years. It encompasses the use of both RGB and satellite spectral images [
20,
21]. As technological developments rapidly advance the versatility and functionality of affordable devices, their potential as a marine aerial survey tool has garnered attention for monitoring aquatic creatures such as humpback whales [
22], river dolphins [
23], and various manatee species [
24,
25], among others [
26]. For this purpose, machine learning models based on deep convolutional neural networks (DCNNs) have enabled the automatic detection of UAV imagery, mostly only using RGB channels with deep convolutional neural networks (DCNNs).
In this regard, the work of Dujon et al. [
27], which focused on understanding the effects of morphology, spacing, behavior and habitat of Australian seals, loggerhead sea turtles, and gannets on CNN-based object detection tasks, can be found to serve as a great example. Similarly, the performance of convolutional object detection architectures has been tested for the aerial survey of Chinese whales [
28] and sharks [
29].
Recent studies have used YOLOv8 for automatic object detection and segmentation, focusing on the preservation of wildlife on various scales, such as monitoring multiple endangered species [
30] or estimating the population of specific subspecies [
31]. However, for aerial images taken by drones, the detected objects are mostly small targets; therefore, the target scale changes considerably due to the influence of the aerial perspective [
32]. In this context, some authors have attempted to solve the detection scale problem by modifying the YOLOv8 architecture. Liu et al. [
33] replaced several components of the original architecture to improve the performance of the model and generalizability. Similarly, Wang et al. [
32] added an attention mechanism to help the model focus on important information. Their multiscale feature fusion network significantly improves detection performance, especially for small objects as demonstrated by their experiments on the
VisDrone2019 dataset. Another approach [
34] focused on improving the model performance in varying lighting conditions [
35].
Moreover, Sliced Aided Hyper Inferencing (SAHI) is an emerging technique that enhances small-object detection by automating tile overlap during model inference. Tile overlap enables the detection of fringe image objects that in adjacent tiles might otherwise split. A recent study demonstrated that SAHI increased average precision by approximately 5.7% for three different predictors after model tuning [
36]. This tool improved detection in aerial UAV image analysis [
37,
38] and is recognized as one of the core solutions of this study, in addition to fine-tuning the augmentation capabilities of YOLOv8.
In recent years, significant progress has occurred in the field of zero-shot learning (ZSL). This machine learning paradigm enables models to make inferences without a training phase. Unlike traditional methods, ZSL leverages shared knowledge or attributes between categories instead of relying on direct experience with each specific class [
39]. This began with the development of pre-trained large language models such as GPT-3 [
40], and later continued with the rise of models capable of combining visual and textual features such as CLIP (Contrastive Language–Image Pre-Training) [
41]. In this work, we employ the zero-shot capabilities of AltCLIP [
42], a vision and language model based on CLIP, which was trained on millions of images and a text corpus from crawling the internet. This is to have similarity in the classification using custom text prompts to define five image classes in our study.
Although drones, or Unmanned Aerial Vehicles (UAVs), have not been extensively explored for manatee monitoring in Panama, our research addresses a single question: if employing UAV images, training YoloV8 detection and SZL classification models can be used as a viable and effective method for detecting manatees in the Panama Canal Basin.
The main contribution of this study is the integration of advanced machine learning techniques, such as deep convolutional neural networks (YOLOv8), Sliced Aided Hyper Inferencing (SAHI), Zero-Shot Learning (ZSL), and Contrastive Language–Image Pre-Training (CLIP), to tackle the complex challenge of detecting and counting manatees in drone footage. This problem is particularly non-trivial due to the small size of the targets, their partial submersion, and the turbid waters of their habitat, such as those found in Panama’s wetlands. By combining these methods, our approach improves the detection accuracy and robustness, addressing key limitations in traditional aerial survey techniques.
2. Materials and Methods
2.1. Study Location
Gatun Lake, which covers approximately 436 km
2(
Figure 1), was created in 1913 as a water reservoir for the Panama Canal, which spans 80 km from the Atlantic to the Pacific [
43]. In 1913, an invasion of aquatic plants with hyacinth was reported in the lake, and efforts were made to find natural ways to control its spread in the lake [
44]. In June 1964, nine Great Caribbean manatees (
Trichechus manatus) and one Amazonian manatee (
T. inunguis) were introduced (as they are not native) into Gatun Lake to control aquatic plants and help address the proliferation of malaria-transmitting mosquitoes for public health purposes. However, the project did not achieve its initial objectives [
11,
45].
According to the Public Health Chapter of the 1966 Panama Canal Company and Canal Zone Government Reports, “...aquatic vegetation in the Chagres River and certain areas of Gatun Lake will require more than a large herd of manatees for effective control”. It was concluded that the size and scope of the Canal Zone waters “...will not be effectively controlled with the use of manatees”. As a result, the animals were abandoned after escaping from the enclosure, and the authorities terminated the experiment.
They are now found in the waters of the Panama Canal [
11,
46,
47]. Reports also suggest that the manatee population has grown extensively [
48]. In 2020, a manatee sighting was recorded in the Panamanian Pacific [
11], confirming reports from several decades ago [
47].
This can harm this species and simultaneously pose a hazard to the normal functioning of the Panama Canal for several reasons. Moreover, it can harm the natural balance and threaten the native local fauna. Thus, it is of the utmost importance that non-invasive methods are used for the assessment of this population that inhabits the Panama Canal Basin.
2.2. Data Collection
Gatun Lake, approximately 436 km
2, was divided into 44 polygons of varying size (ranging from 48 to 856 ha) to cover the entire lake. This division accounted for overflight restrictions imposed by the Panama Canal Authority on the main shipping channel, and also takes into account the morphology that exists where the borders of the lake and the existing forests vary in size. Finally, they were designed to reduce flight time. The locations of these 44 polygons are shown in
Figure 1.
Each polygon was covered by preprogrammed flight lines, usually with a 25% forward and 40% lateral image overlap and speeds between 5 and 12 m/s. The angle of the gimbal was fixed all the time. The flight altitude was maintained between 60 and 100 m, according to the specific flight area. All flights were conducted every 15 days to cover the entire area, depending on weather conditions (wind, rain), between 8:00 and 11:00 h, from February 2023 to April 2024. Two drones were used: a DJI Mavic 3 Enterprise model (DJI, Shenzhen, China) equipped with a 20 MP camera and a DJI Matrice 200 model (DJI, Shenzhen, China) with a 24 MP camera. Each one was flown by their own licensed pilot. Photos were taken every 3–5 s with a consecutive image percent overlap of 20 and 80 for front and side, respectively, taken along the flight path.
2.3. Data Organization
Two separate pilots obtained three image collections following a previously defined data collection protocol. The first two collections comprised 64,240 and 15,458 images for Pilots 1 and 2, respectively. Subsequently, these datasets were used for training and testing purposes (chosen randomly in each set). A third collection of 22,336 images from the latest pilot 1 flights was separated as unseen data to test the validity of the model. Original image resolutions ranged from 3840 × 2160, 4000 × 3000, and 5280 × 3956 pixels for Pilot 1 and 9504 × 6336, 6016 × 4008 and 6016 × 3376 pixels for Pilot 2. After manually inspecting both image collections, we observed differences in the quantity and quality displayed in diverse conditions such as altitude, landscapes, and visibility of the target animals. Taking these factors into consideration, the Pilot 1 collection was selected as the training dataset due to a higher quantity of images and less complex environments with more visible animals. On the other hand, Pilot 2 was selected for testing, as it contained images with higher complexity (less visible animals in worse lighting conditions); this is ideal to test the model´s performance and detect possible signs of overfitting.
A total of 65 flights were conducted, covering an approximate area of 26,678 hectares in 220.9 flight hours. Initially, the polygons were sampled biweekly until we found the areas most frequented by manatees, where we focused our subsequent periodic efforts.
From these efforts, 76 manatees were observed in 57,332 images acquired and initially analyzed manually. Sixteen mother–calf pairs (ranges 1–3 pairs) were observed on nine sampling days, yielding 179 images for both pilots.
For object detection tasks, images of the same animals were arranged as a series of sequences (see
Figure 2) to avoid mixing similar images in training, validation, and testing datasets. Pilot 1 images, consisting of 30 sequences, were used for training, while Pilot 2 images, which had the fewest sequences, were used for testing.
On the other hand, for classification tasks that did not require a training phase, a subset of 316 image crops was used to evaluate the performance of ZSL. This dataset comprised images of manatees from all sequences manually selected to avoid similar instances of the same animals. It also includes photos of logs and background images (see
Figure 2).
2.4. Detection Model
The YOLOv8 architecture comprises three primary components: the backbone, neck, and head. The backbone typically employs a deep convolutional neural network (DCNN) to extract feature maps from the input images. These feature maps are then passed to the neck, which utilizes pyramidal methods to enhance and refine features, capturing multiscale information crucial for detecting objects of various sizes. Finally, the processed features are fed into the head module, which predicts the bounding box coordinates of each detected object and the class probabilities.
YOLOv8 was chosen for its balance between performance, efficiency, and real-world applicability [
49]. More importantly, YOLOv8 has been shown to work well in determining small targets in vast backgrounds. This is especially true when coupled with “Sliced Aided Hyper Inferencing (SAHI)” [
50,
51,
52]. Thus, it is a perfect fit for the detection and classification of small, partially submerged manatees in complex aquatic environments such as the Panama Canal basin.
Although newer YOLO versions may offer improvements [
53,
54], one of the future objectives of this study is to potentially implement a customized embedded system, and for this reason, YOLOv8 remains a stable, well-supported, and computationally feasible choice.
This architecture allows YOLO models to achieve a balance between accuracy and speed, making them highly effective for real-time object detection tasks built from the following essential blocks. The building blocks of this network are the following:
CBS layers: They are compound layers of convolutional filters, batch normalization, and the SiLu loss function. It performs a convolution operation that extracts images features by applying filters to the input data.
C2f layers: They are designed to enhance feature extraction and fusion. They do this by splitting the input features into two parts: one is processed through several convolutional layers, while the other remains unchanged. Then, these parts are merged to allow the network to retain original information while adding refined features.
SPPF layer: This is a fast variant of the Spatial Pyramid Pooling (SPP) layer. It pools features using different sizes to capture details from various levels and then combines these pooled features into a comprehensive representation. This enhances the model’s ability to recognize objects of different sizes while keeping the computational efficiency.
Concatenation layer: This is used to combine several feature maps along a specified dimension (usually the channel dimension).
Upsampling layer: This is used to increase the spatial resolution of feature maps.
A visual description of the network is provided in
Figure 3.
2.5. Training and Validation Parameters
For all experiments, fine-tuning was performed on YOLOv8s (small) parameters, a smaller model version with default parameters for the learning rate and optimizer. Training and validation samples were generated using 640 × 640 crops (with 20% horizontal and vertical overlap) using SAHI. All samples were labeled using the YOLO bounding box format with two target classes: class “manatee” and class “logs”.
Figure 4 provides a clear example of the two classes co-existing in a single image.
Figure 3.
Simplified graphical representation of the YOLOv8 architecture.
Figure 3.
Simplified graphical representation of the YOLOv8 architecture.
Figure 5 provides examples of the “logs” class, which also contains the undesired background objects and landscapes that can be found in the dataset.
Model parameters include online data augmentation. These techniques, native to YOLOv8, apply image and color space transformations to random images spliced into a composite image and fed directly into the neural network. This technique is inherited from the previous model version, YOLOv4 [
55], allowing the identification of objects at a smaller scale than usual while significantly reducing the need for a large mini-batch size.
Figure 6 provides examples of the data augmentation transformations applied during training.
2.6. Zero-Shot Classification Model
Once manatees were detected by YOLOv8, a method was designed to count individuals and post-classify images using the zero shot learning (ZSL) approach. This method addresses a significantly important issue that arises when counting individuals, adult–calf pairs, or large groups that might appear in the current dataset. The model often can make mistakes, as it is trained on limited samples and highly imbalanced classes.
ZSL is implemented as a similarity learning algorithm. That is, a large pre-trained model (such as AltCLIP) enables image and text features to coexist in the same dimensional space (i.e., embeddings) and then calculates similarity using the cosine distance [
42].
AltCLIP represents a novel integration in this context, as it addresses the challenges of detecting small, partially submerged manatee groups in such complex environments such as the Panama Canal Basin.
Figure 7 provides a visual description of the AltCLIP model architecture used for image processing.
The model creates an image embedding using the CLIP encoder. A similar process is performed with a multi-language encoder to produce another embedding, which is then projected to co-exist in the same embedding space. Contrastive learning is the core of making text and image embeddings coexist, as the model was previously trained using pairs of matching (positive) and non-matching (negative) image–text examples. Therefore, it uses the acquired knowledge from general classes to adapt to the downstream task (classification of different manatee classes) [
42]. Finally, after calculating cosine similarity, the resulting arbitrary values are converted into a probability distribution to find the most probable text class assigned to the input images.
In this way, the model can classify without needing labeled training data. The model requires the following components to perform the classification process:
Image pre-processing: To normalize the appearance of manatees in diverse environments and lighting conditions, we employed YCrBr decomposition. This techniques separates the luminance (Y) component of an image from chrominance (Cb and Cr) (see
Figure 8). It was demonstrated that subtracting the Y channel is crucial for separating luminance and reflectance, which helps to reduce distortions and normalize the effects of water [
56]. We employed the same technique using the Y channel as the primary input for the model, preceded by normalization and resizing to 224 × 224 pixels (the input size for AltCLIP).
Custom class prompts:
Table 1 shows the descriptive text used to define the classes. Since large models are trained with thousands of general classes, we relied on the literal word, for instance, “manatee” and a focused on descriptive text based on morphological features such as the distinctive “fish” shape of manatees (this was when observed from a high perspective). These methods proved to be the most effective way to guide the model to understand the features of the classes. In contrast, more literal words were used to describe the “background” and “logs” classes.
2.7. Model Evaluation Metrics
As with any classification task, the purpose is to build a model that learns the features that define each class. In our case, the images were labeled as belonging to two classes, manatee and logs, and the rest were taken as the background. Logs were considered an additional class since they are widely present in many locations of the Panama Canal due to drought periods, making them visible due to lower water levels in the landscape [
57,
58]. More importantly for the model, in some cases, the silhouettes of the logs were like manatees and a source of classification errors during the first training runs.
Classic accuracy metrics (precision and recall) were used to evaluate model performance in these two classes. These metrics include mean average precision (mAP) and intersection over the union, which are briefly explained here as follows:
Intersection over Union (IoU): It is a number that indicates the overlap between the predicted bounding box coordinates of those of the ground truth (where the class is in the image). Higher IoU indicates that the coordinates align more, which is a better prediction. This is important, as the IoU threshold value helps us decide whether a prediction is True Positive, False Positive, or False Negative. It is calculated with Equation (
1):
In this equation, the area of intersection is the region where the predicted and ground truth boxes overlap, while the area of union is the total area covered by both boxes combined, accounting for the overlapping region only once.
Mean average precision (mAP): This compares the predicted class associated with the region of interest and the ground truth found in the original label of the images and returns a score. The higher the score, the more accurately the model detects the object in question. It is calculated with Equation (
2):
where
and
are the precision and recall at the
threshold. mAP is evaluated using a confidence threshold, which is the minimum probability of an output detection being considered correct or incorrect. Detection models are usually evaluated at 50% confidence (mAP50) and averaged between 50% confidence (mAP50), and averaged between 50% and 95% (mAP50:95).
Fitness: This metric is used in YOLO models to compare the performance of different models as a weighted combination of mean average precision. It assigns a 10% weight to mAP50 and 90% to mAP50:95 [
59]. It is calculated using Equation (
3)
This Equation is used to evaluate the training of all YOLOv8 models by implementing a cross-validation stategy.
Top-k Accuracy: This is employed to evaluate classification models by measuring the proportion of instances where the true label is among the top k predicted labels ranked by the model’s confidence. It generalizes traditional accuracy where only the top prediction counts and by considering multiple plausible predictions for each image. This image helps to understand the possible overlaps between classes in zero-shot classification.
2.8. Experimental Setup
The following experiments were performed in the following setup:
Detection using YOLOV8 (Experiment 1): Fine-tuning was performed using the training subset described in
Section 2.3; a 10-fold random cross-validation experiment was conducted. Afterwards, using the model with the highest fitness value (
Section 2.7), the performance on the test subset was evaluated to determine the degree of learning of the model and its effectiveness on unseen data.
Detection of small objects in full-resolution images using SAHI (Experiment 2): Using the top-performing model from Experiment 1, we conducted inference on full-resolution sequences to evaluate the model’s capability in a real-world scenario. The goal was to determine if manatees, as small objects in large-scale imagery, could be reliably detected when dealing with the complexities of high-altitude aerial frames.
Zero-shot classification of manatees (Experiment 3): Following the detection performance evaluation, we tested the ability of AltCLIP to identify individual manatees without training data. This was performed using the previously used training and testing subsets from Experiment 1, using the image processing techniques and text prompts outlined in
Section 2.6. The goal was to assess the effectiveness of zero-shot learning on counting individual animals and identifying key demographic groupings such as calf–mother pairs.
2.9. Computational Hardware and Software
All experiments were performed using a custom PC with AMD Ryzen 9 5950X CPU processor, EVGA RTX 3080 with 10 VRAM GPU, and 128 GB of RAM running on Ubuntu 22.04 kernel. Python (3.9.17) and libraries for image and data pre-processing were used, among them: Numpy (1.26.4), Scikit-learn (1.0.2), Librosa (0.10.1), OpenCV (4.8.0.76), and Scipy (1.11.2).
As stated earlier, a specialized version of YOLO was used, and the YOLOv8s model can be obtained in Ultralytics [
60]. Moreover, to analyze the convolutional layers of the neural network in detail, and to have a clear understanding of the internals of the neural networks, representation software (GradCAM [
61]) was adapted to work with Pytorch and YOLOv8. Finally, the AltCLIP model and its pipeline and pre-processing functions were used from the
Huggingface Transformers library [
62].
4. Discussion
Our research addresses the question of whether drone images from the Panama Canal Basin can be used to training YoloV8 detection and SZL classification models. The results regarding YOLOv8’s capabilities to detect manatees indicate that the model can identify most instances of manatees in the monitored areas of the Panama Canal. As shown in Experiment 1, the objective was to assess the object detection performance under ideal conditions using small, low-resolution images where most manatees were easily visualized. The loss per epoch plots shown in
Figure 9 and
Figure 10 indicate that the model struggled during the initial epochs due to fluctuations in error across specific batches of images; this could be attributed to the high variability among some manatee sequences. However, the model produces stable metrics at the end of the training cycle, reaching high precision (93%) on known data in all classes (
Table 2).
Looking at the performance of the model in unseen data,
Table 2 shows a very low recall (43.8%) in the test set, indicating the presence of false negatives, i.e., undetected manatees (
Figure 15A). Moreover, the precision value (86.2%) indicates a higher number of false positives (
Figure 15B). This could be due to environmental factors, such as the abundance of undesirable objects (see
Figure 16). These objects can introduce inference errors, such as underwater soil that creates patterns that are difficult to assess by the model (shown in
Figure 16A), or water surface reflections and underwater logs, which could be mistaken for a floating manatee (seen in
Figure 16B,C).
Experiment 2 reveals insights into the actual accuracy of YOLOv8 in a more complex setup, specifically, high-altitude shots.
Table 4 shows that manatees were detected in 28 of 30 sequences. It is interesting to note that most of the undetected sequences involved higher-altitude shots in complex scenery. Examples of this complexity are shown in
Figure 17A, which depicts a calf-mother pair that can barely be identified due to the high altitude (199 m). In contrast, a clear case of an image where manatees are present in a complex environment is shown in
Figure 17B, illustrating two manatees swimming in a feeding spot, but their characteristic fish shape is barely noticeable. It is also worth mentioning that this image is unique among all datasets.
Manatees were detected in 11 of the 15 sequences in the test set.
Figure 17C is an example of easy detection, that is, images which are less complicated to understand by the model, as they differ the most from the training set samples. However, low performance is expected in the detection of manatees in some images, because the model struggles to fully learn the features from unseen data (primarily because the test set comprises more complex shots); one of these more complex examples is sequence 10 (shown in
Figure 17D). Given these results, it is safe to conclude that the model can detect most manatees in favorable lighting conditions. However, lower performance is expected at high altitudes or in very complicated shots with unfavorable lighting or environmental conditions, a common issue with drone footage.
The ZSL classification scheme uses text prompts with AltCLIP. The metrics shown in
Table 6 indicate that the model can identify individual manatees, logs, and background images, all with high F1-scores. However, it struggles to correctly count mother–calf pairs and groupings of more than two manatees. The following factors could explain the lower performance when trying to identify multiple manatees:
In some scenarios, groups of multiple manatees with precise shapes and morphological characteristics are not visible as shown in
Figure 18. The characteristic fish-like shape can be evident (
Figure 18C) or less clear when viewed from higher altitudes (
Figure 18A), leading to classification errors that result in three classes (logs, mother–calf pairs, and groups of manatees), having very similar probabilities of detection (
Table 7)
In some cases, when a manatee is not visible due to lighting or environmental conditions (as seen in
Figure 18B), adjacent logs could be mistaken for manatees, given by the critical condition established in the text prompt for the mother–calf class, where one animal should be smaller than the other (see
Table 1).
There is a clear overlap between the classes, especially when counting multiple animals. This is reflected in the output probabilities (
Figure 18C and
Table 7), where both mother–calf pairs and many manatees are plausible classes.
5. Conclusions
This work assessed the abilities of YOLOv8 in the detection of manatees under various environmental conditions and geographical locations. The results showed decent output metrics, achieving a mean accuracy of 70 to 90%. This indicates that despite the high variability in the scenery and appearance of manatees, it is possible to train a model for manatee detection. Further exploration of data-augmentation techniques could enhance the model by evaluating different combinations of image transformation variables, such as perspective, rotation, and color.
Exploring more complex model architectures, such as YOLOv9 [
63] or YOLO-NAS [
64], along with various hyperparameter configurations, may yield better results. However, we advocate the use of YOLOv8 due to its direct compatibility with SAHI and its ability to improve small object detection performance. The size of the model must also be taken into account, as it is a crucial factor in the implementation of a drone portable real-time system [
18]. Despite class overlap and the complexities of counting multiple manatees of different sizes, AltCLIP achieves a high top-2 accuracy close to 90% (
Table 6). It is also worth noting the usage and accessibility of zero-shot learning, as it does not require training procedures and allows users to focus on feature pre-processing rather than dealing with potential computational constraints. This is especially convenient in situations of low data availability, where adding more high-quality data is very difficult and time consuming, although AltCLIP can struggle to distinguish multiple manatees or mother–calf pairs due to similar morphological features and overlapping “visual cues”. According to Metzen et al. [
65], a method known as AutoCLIP could be used to automatically adjust the importance of each text prompt based on how well it matches the image. This could help AltCLIP better differentiate between visually similar classes by emphasizing key distinguishing features (e.g., “two manatees swimming together” vs. “a single manatee”) at inference time.
Furthermore, experimenting with different text features could produce better results. Using different languages introduces another level of complexity when describing the scenery and the manatee morphological characteristics, as contextual descriptions vary in languages such as Spanish. Finally, more advanced models, including large language models (LLMs), vision language models (VLMs), and multi-models, could be explored [
66,
67,
68,
69].
Looking ahead, future research should prioritize improving detection models to improve accuracy, especially in identifying adult–calf pairs, which are vital for understanding reproductive dynamics. In addition, recognizing manatees traveling in groups is critical for precise population assessments and for projecting future trends and the overall population in this basin. Moreover, refining the process to determine if the same individual appears in multiple images, and avoiding duplicate counts by temporal tracking or spatial–temporal analysis techniques are essential for accurate estimations [
49,
70].
This study is of particular importance, as it provides the first comprehensive insight into the manatee population in the Panama Canal Basin by using a well-established computer vision methodology for detecting and classifying manatees in the Panama Canal Basin. This methodology could be used to track individual movements, assess habitat use, and gauge the impacts of wetland deforestation and changes in land use occurring in the basin. Finally, the work sets the stage for future ecological research and conservation strategies for this endangered IUCN species in Panama.