1. Introduction
Posidonia oceanica is an endemic seagrass species of the Mediterranean Sea that forms complex marine meadows and provides significant ecosystem services for ocean life, such as a suitable habitat and nursery grounds, water purification and oxygenation, carbon sequestration, and coastal protection [
1,
2].
Posidonia meadows were identified as a priority habitat type for conservation and management, aiming to reduce the negative impacts of the increasing coastal urbanization, the uncontrolled use of destructive fishing methods, climate change and the invasion of alien species (Habitats Directive 92/43/EEC, 1992). Given the documented lack of relevant information in parts of the Mediterranean sea and an estimated regression of meadows in the order of 34% in the last 50 years [
3], the effectiveness of protection and restoration measures has to be regularly monitored using cost-, time- and effort-effective tools in order to detect changes in the extent and distribution of meadows in a timely manner. At the same time, monitoring the
Posidonia meadows in shallow coastal areas is very important as an indicator of the habitat’s health, but is often subject to technical and operational constraints. The formation of dead
Posidonia matte (i.e., dead rhizomes and roots, dead leaves, and drift epibionts) is also an important habitat component, functioning as an additional carbon sink, a food source for detritus feeders, and a natural banquette reducing the impact of wave energy [
2]. Therefore, the extent and variability of areas where the dead
Posidonia matte is aggregated is also a useful parameter that is worth monitoring using habitat mapping tools.
Habitat mapping is defined as the identification, delineation, and documentation of the spatial distribution of habitats in a specific geographical area [
4]. Considering benthic habitat mapping, a variety of methods exist for data acquisition and analysis, including remote sensing via optical or acoustic means and field surveys by scientific divers [
5], statistical methods for data integration/classification, and Geographic Information Systems (GIS) for spatial analysis and map production [
4]. Methodologies based on the use of remote sensing for mapping the habitat of
Posidonia meadows have been implemented using both optical [
6,
7,
8] and acoustic [
9,
10,
11,
12] means, often accompanied by field surveys by scientific divers or some other ground-truthing (GT) approach. As expected, each family of methods has some advantages and some limitations [
5,
13], and as a result, methods combining both acoustic and optical techniques have emerged [
14,
15,
16,
17].
Advances in robotics, aerial drones, autonomous vehicles, and AI-driven analytics are providing marine science with powerful tools for more efficient, accurate, and large-scale data collection [
18,
19]. These technologies improve ocean monitoring and analysis, minimize the need for human intervention in remote or challenging environments, and significantly reduce operational costs. Robotics and autonomous systems contribute to high-resolution data collection [
20], while AI facilitates data integration, pattern recognition, and predictive modeling, making complex ocean data more accessible and actionable [
21]. The combined use of these technologies presents both opportunities and challenges for the development of novel sampling and analysis strategies.
Methodologies which are appropriate for coastal applications are preferable, if not necessary, for the purpose of mapping the habitat of
Posidonia meadows. Small vessels, such as autonomous surface vehicles (ASVs), allow for better maneuverability in the coastal topography and can more easily traverse shallow waters [
8,
22]. In an alternative approach, the use of unmanned aerial vehicles (UAVs) with high-performance optics provides a low-cost method to retrieve habitat information at very high fidelity [
23]. However, this method is also constrained by weather conditions [
23], and, more importantly, has a limited range of application with respect to the bottom depth [
24,
25,
26], which limits its applicability at more shallow depths than the expected potential habitat for
Posidonia meadows.
The current study, performed as part of the project NAUTILOS (Horizon 2020; GA 101000825), aims to enhance temporal regularity and spatial resolution in marine monitoring and takes into account the Intergovernmental Oceanographic Commission Criteria and Guidelines on the Transfer of Marine Technology (UN SDG 14). More specifically, the study aims to develop an integrative method for mapping Posidonia meadows by combining optical data collected by a UAV, acoustic data collected by a portable MBES mounted on an ASV, and visual census (VC) datasets from scientific divers. Following data acquisition, the study employs a machine learning approach for the habitat mapping of Posidonia meadows. Independent classifiers for the optical and acoustic data are used in a transfer-learning scheme, where the high-fidelity photogrammetry dataset is utilized to improve the performance of a habitat classifier based solely on acoustic information. The advantages of this approach are twofold: (a) the methodology produces a model based on portable-MBES data that performs in-par with a photogrammetric classifier, and (b) following an initial training process, the acoustic classifier can operate based only on MBES data, in conditions beyond the capability of UAV optics, considering mainly the bottom depth but also the dynamic optical characteristics of coastal waters. The study focuses on maximizing the benefits of each component, while critically evaluating the limitations of this hybrid approach.
2. Materials and Methods
2.1. Study Area
The survey operations were carried out in Gournes area (35.33549, 25.28038) near HCMR infrastructures in Heraklion, Crete (Greece) (
Figure 1).
2.2. Data Acquisition and Ground-Truthing
The field operations were carried out between the 20th and 26th of October 2023 and included (a) a UAV-based photogrammetry survey, (b) an ASV-mounted MBES survey, and (c) a VC conducted by scientific divers for GT. Due to operational challenges caused by interference between the radio communication systems used by the UAV, ASV, and their respective ground control stations (GCS), the UAV survey had to be ceased earlier. As a result, the original dataset presented an incomplete overlap between the UAV photogrammetry and the MBES survey polygons. To address this limitation, an additional aerial survey was conducted on 4 July 2024. A detailed manual comparison of the photogrammetry dataset from the two survey dates revealed no significant variations in habitat conditions within the overlapping mosaic areas. Consequently, the more recent dataset was selected for analysis to maximize the spatial intersection between the MBES and photogrammetry mosaics. This approach also allowed for the optimal integration of all GT data collected during the study.
The UAV utilized in this study was the DJI Mavic 2 Pro quadcopter, which has a takeoff weight of 900 g. The UAV was equipped with a high-performance Hasselblad L1D20c camera mounted on a fully stabilized three-axis gimbal, ensuring superior image stability and quality. The camera features a 1-inch CMOS RGB image sensor and 35 mm lens with a maximum aperture of f/2.8, enabling excellent light sensitivity and image clarity. To conduct the aerial survey, the DJI Pilot GCS software ver. 1.14 was employed, allowing for the precise execution of flight paths along 22 parallel transects. These transects were evenly spaced at 20 m intervals to ensure comprehensive area coverage. The UAV maintained a consistent flight height of 50 m above mean sea level and operated at a speed of 5 m/s. These parameters were selected to achieve a 70% image overlap between adjacent samples, which is critical for robust image stitching and analysis. Images were captured at 5 m intervals along the transects using a time-interval shot mode. The camera settings included an auto white balance configuration and shutter priority at 1/400 s, chosen to minimize motion blur and optimize image sharpness under varying lighting conditions. The survey yielded a total of 520 high-resolution RGB images, each with a resolution of 5472 × 3648 pixels (20 MP). These images formed the basis of the final dataset for subsequent analysis, providing detailed spatial and spectral information for the study area.
For the MBES survey, CEiiA’s ORCA ASV, designed for inland and coastal applications, was utilized. ORCA is a 3.4 m twin hull ASV equipped with a 240 kHz Imagenex DT101Xi MBES, which integrated a motion reference unit and a surface sound velocity sensor. DT101Xi was interfaced to a Differential Global Positioning System (D-GPS) module which supported Real-Time-Kinematic corrections via a mobile network connection. For data acquisition, DT101Xi was configured to utilize all 480 beams (effective beam width 0.75 degrees) with a sector size of 120 × 3 degrees and the transmission range was set manually at 10 m. To ensure accurate depth measurements, a Valeport SVP sound velocity sensor was also manually deployed at 21 locations within the survey area onboard the RHIB of HCMR “IOLKOS”. The acquired sound speed profiles were posteriorly applied on the refraction correction process of the MBES data.
The HCMR Scientific Diving Team performed GT operations in the study area, utilizing a combination of geotagged underwater photographs and video recordings. This methodology provided high-resolution spatial and visual data to complement other survey techniques. The divers were equipped with a surface D-GPS unit, an underwater camera (Sony RX100V) and an underwater video system (GoPro ver11), ensuring precise documentation of the surveyed habitats. At the test site, the team performed both horizontal and vertical VC transects to systematically document the underwater environment. One transect was strategically selected for detailed analysis. Along this transect, photographs were captured at predetermined intervals, targeting a diverse range of locations and habitats to ensure comprehensive coverage. Over the course of a 65 min dive, the divers successfully captured a total of 104 high-quality photographs, offering a robust dataset for subsequent analysis (
Figure 2). These photographs not only provide valuable insights into the biodiversity and habitat complexity of the area but also serve as a crucial validation tool for remote sensing and mapping efforts.
Through VC, three primary habitat types were identified: sand (soft sediment), dead matte and
P. oceanica meadows. Additionally, at very shallow depths (<1 m), scattered coastal reefs were observed. However, this habitat type was excluded from the analysis as it did not overlap with the ASV survey area. To establish GT points from the VC data, all photographs were geotagged using the free software GEOSetter ver. 3.4.16 (
https://geosetter.de; accessed on 1 March 2023). To ensure precise geotagging, the internal clock of the underwater camera was synchronized with the portable D-GPS time. GEOSetter then assigned geotags to the photographs by matching their timestamps with the corresponding D-GPS tracking data.
2.3. Data Processing
The UAV-based imagery was processed using Pix4Dmapper ver. 4.4.12. With this SfM (Structure from Motion) software package, 3D models and 2D raster products can be generated in a fully automated five-step process, comprising (i) alignment of the photographs, (ii) calculation of a sparse point cloud, (iii) calculation of a dense 3D, (iv) polygonal mesh model generation and texture mapping, and the (v) generation of Digital Surface Models (DSMs) and ortho-rectification of the imagery [
27]. Firstly, images were aligned with the accuracy parameter set to ‘high’. After the photo-alignment, this initial bundle adjustment created sparse point clouds from overlapping digital images. The sparse point clouds included the position and orientation of each camera position and the 3D coordinates of all image features. The internal camera geometry was modeled through self-calibration during the bundle adjustment [
28]. Subsequently, dense point clouds were built based on multi-view stereopsis algorithms with high-quality and mild depth filtering. After filtering the dense point clouds according to points of confidence (points with values less than three views were removed), these were used for producing polygonal meshes and DSMs using an Inverse Distance Weighting interpolation. Finally, the DSMs generated ortho-rectified 8-bit RGB photomosaics of submerged habitats.
Raw MBES data collected by the onboard DT101Xi software ver. 1.02.08 were reprocessed with Autoclean software ver. 2023 3.1.0, by BeamWorx in order to incorporate sound profile information to improve the positioning of individual acoustic bottom samples, and apply filtering techniques to reduce noise, and remove potential artifacts and outliers from the dataset. The resulting georeferenced mosaics with a resolution of 1 m × 1 m for the bottom acoustic backscatter (BS) intensity (provided as an 8 bit integer) and depth were exported in GeoTIFF format. Each mosaic extends over a total area of 74,650 m2 within an enclosing rectangle of 332 m by 329 m in X (longitude) and Y (latitude) axes, respectively. Bathymetry exhibits small variations throughout the study area, with an average depth of 5.3 m and st. dev. 1.5 m. Regardless, the depth mosaic was not used in the rest of the analysis to avoid producing a depth-locked model.
For a given acoustic pulse frequency and incidence angle, the level of BS returned by the bottom’s surface mainly depends on physical properties, i.e., its acoustic hardness and spatial roughness, and the heterogeneity of the sediment [
29]. Apart from geological formations, bottom-associated flora and fauna can also be a source of heterogeneity, typically increasing variability in the observed scattering [
30]. For this reason, in addition to the MBES acoustic BS mosaic, secondary features based on the spatial structure of the BS and associated statistics were also extracted. In particular, the (a) spatial gradient (b) roughness (degree of irregularity, calculated by the largest inter-cell difference in a central pixel and its surrounding cell) and (c) spatial mean, variance, min and max for 2, 5, 10 and 20 m radius mosaics were produced directly from the original BS mosaic and exported as separate GeoTIFF files. The BS gradient and roughness were computed in QGIS version 3.28.15 [
31], while the other BS statistics were calculated in Python v 3.11.7 [
32].
Initial testing with the VC GT data and the photogrammetry mosaics produced models with unimpressive accuracy scores. This was attributed to the spatial distribution of GT points, which were close to each other along the dive transect (
Figure 2 and
Figure 3), providing a poor and inhomogeneous spatial coverage of the surveyed area and an unbalanced representation of the three habitat classes. However, due to the clarity of the photogrammetry mosaic and the uncomplicated habitat structure of the studied area, habitat identification at the required class level was also possible via the direct expert annotation of the image, as verified by comparisons at the GT data locations. Therefore, it was decided to enrich the VC GT dataset with additional points (
Figure 3) to provide a more uniform coverage of the surveyed area and a balanced representation of all classes. This work, carried out by a member of the HCMR Scientific Diving Team, resulted in an augmented GT dataset including the manually annotated habitat characterizations and a subsample of the original GT VC data. Unless specified, for the rest of the manuscript “GT” will be used to refer to the augmented GT dataset.
2.4. Habitat Classification Models
The main habitat classification models implemented in this work include (a) a supervised classifier based on the georeferenced photogrammetry (UAV/optical) dataset and trained with the GT data, here called the “teacher” model, (b) a supervised classifier based only on the BS-derived (ASV/acoustic) dataset and trained with predictions from the teacher model, here called the “student” model, (c) a supervised classifier based on the BS-derived (ASV/acoustic) dataset and trained with the GT data, here called the “direct” model, implemented for method validation purposes, and (d) an unsupervised classifier based on the BS-derived (ASV/acoustic) dataset, also used for method validation.
The teacher and direct models can be considered single-source models, the former optical and the latter acoustic, both trained with the GT data. The student model is a hybrid model in the sense that, although it only uses acoustic data for prediction, it was produced via training with the teacher model’s predictions. At the same time, the model was developed so that, following the training procedure, it no longer requires information from the teacher model (or other UAV data), and can be used for surveys implemented solely with the ASV-mounted MBES system. Finally, as already stated, the MBES depth information (and any corresponding secondary features) were intentionally discarded, since otherwise the produced model could only be used for predictions in the depth range of the train dataset, which is dictated by the depth-limited UAV optics, and of course the topography of the train area.
All data preparation, classifier training, cross-validation, projections, and plotting were performed with the Python scientific stack [
33,
34,
35], and especially the scikit-learn library [
36].
2.4.1. Assembly of the Feature and Label Arrays
In the case of the teacher model, feature values were produced by interpolating the R, G, and B channels of the photogrammetry mosaic, at the locations of the 242 GT points, and the corresponding habitat types were used as labels.
For the student model, the feature value set consisted of all valid values from the BS and BS secondary feature mosaics (spatial gradient, roughness, and spatial statistics), excluding any data within a one (1) meter distance from any GT point. This operation resulted in a total of 69,080 23-element feature vectors. To train the student model, teacher model predictions were utilized. To achieve this, the XY coordinate pairs corresponding to each feature vector were used to interpolate the R, G, and B channels of the photogrammetry mosaic. The interpolated RGB values were, in turn, used as input to the teacher model to predict the corresponding labels, thus producing the label array for training the student model.
For the direct model, feature values were produced by interpolating the BS and BS secondary feature mosaics at the GT point locations, and the corresponding habitat types were used as labels. Due to the BS layers being smaller than the final photogrammetry mosaic, some GT points were outside the BS data and were not included, leading to a total of 148 23-element vectors for the features and corresponding labels for the direct model.
In the case of the unsupervised model, all valid values from the BS and BS secondary feature mosaics (spatial gradient, roughness, and spatial statistics) were used in the analysis, leading to a total of 73,282 23-element vectors. This is the same as the feature set of the student model, without the step of excluding data in the vicinity of the GT points.
2.4.2. Train/Test Splitting
Each feature/label set was split in a 0.8/0.2 fashion to produce the corresponding training and testing datasets. As a result of this procedure, for the teacher model, 193 points were used for training and 49 for testing, while for the student model, 55,264 points were used for training and 13,816 for testing, and for the direct model, 118 points were used for training and 30 for testing. The training datasets were then used for model parameter estimation and training, and testing datasets were withheld to assess each model’s performance against the “new” data. For the student and direct models, splitting was performed in a stratified manner to ensure a similar representation of the three habitat classes in training and testing, while for the teacher model, simple shuffling was used, since the GT data that were used were nearly perfectly balanced across the three classes.
2.4.3. Optimization of Model Parameters
For each of the supervised models, a Random Forest (RF) classifier was fitted with the appropriate training dataset. RF is an ensemble learning method that fits a number of randomized decision trees (hence, “forest”) to perform a specific classification task and uses averaging across the different estimators to improve the predictive accuracy and control overfitting [
37]. RF classifiers have been previously utilized for marine habitat mapping in a variety of studies [
16,
38,
39,
40,
41]. Before fitting the RF models, standardization of the feature vectors was performed by subtracting the mean and scaling to unit variance. RF hyperparameters were optimized in each case to maximize prediction power while reducing the risk of overfitting and selection bias. This was achieved by utilizing the method of k-fold cross-validation [
42], i.e., using randomized fit/validate subsets of the training dataset while performing an exhaustive search (gridsearch) over a specified range of the hyperparameter space to find the parameters that, on average, produced the best model. In particular, (a) [1, 2, 3, 4, 5, 7, 10] was searched for the parameter ‘min_samples_leaf’, (b) [1, 2, 3, 4] was searched for the parameter ‘max_features’, and (c) [100, 200, 300, 500] was searched for the parameter ‘n_estimators’ for five folds of the training dataset of each classifier; for the parameter names, refer to the RandomForestClassifier of the Python Scikit-learn library. The best model that was produced was re-trained against the total training dataset and used as the final classifier.
For the unsupervised classifier, a Principal Component Analysis was applied to reduce the 23 dimensions of the feature array to three (3) principal components (PC), and then a k-means clustering algorithm was applied on the PC to separate the data into three (3) classes. The labeled output was compared with the teacher model prediction at the same locations and the unsupervised labels were associated with the appropriate habitat type based on the highest % of overlap and renamed accordingly.
2.4.4. Model Assessment and Projections
For the supervised classifiers’ assessment, each model’s average score and its st. dev. were calculated based on five individual folds of the specific train dataset, the score was compared to the model-specific test dataset, and the corresponding confusion matrix [
43] was calculated. Accuracy, i.e., the fraction of correct predictions, was used as the score function for all RF models. The evaluation metrics used in the study include (a) the accuracy score, defined as the proportion of all correct classifications, and (b) the precision, defined as the proportion of all positive classifications that are actually positive. In particular, since this is a multi-class problem, the so-called macro-weighted precision was computed, i.e., the precision metric was calculated for each label separately in the classic binary sense and the averaged results were weighed by the number of instances of each class.
In addition to the above, specifically for the teacher model, replicates of the classifier were produced for different sizes of the train dataset in the context of a sensitivity analysis with respect to the size of the GT dataset. In particular, the GT dataset was split 25 times in a stratified manner with the training set ratio ranging from 5% (24 GT points) to 90% (229 GT points), and in each case, the remaining GT data were used for testing. The average score and its st. dev. based on five folds of the train dataset was computed for each of the 25 instances of the model, and the corresponding score against the test dataset was also evaluated. For the student and unsupervised models, the total accuracy scores and confusion matrices were also calculated based on the GT data intersecting the BS layers.
The final trained models were used to produce habitat maps for the study area and the total area per habitat type was calculated. For the student, direct, and unsupervised models, the degree of overlap for each class with respect to the corresponding prediction of the teacher model was also estimated.
4. Discussion
Effective habitat mapping of
P. oceanica meadows can be performed by both optical (satellite or aerial) and acoustic means (most commonly single-beam echosounders, MBES, or sidescan SONAR systems). Each approach has different advantages and disadvantages, with a common issue being the limited applicability of optical methods at higher depths or in cases of reduced visibility, and the arguably more demanding sampling requirements when utilizing acoustics. The latter often results in lower quality when considering areas where both methods are applicable, which is especially true in the case of more affordable and portable MBES systems, such as those that are easier to integrate into autonomous vehicles. In order to overcome such issues, various approaches have been developed based on hybrid optical/acoustic techniques [
14,
15,
16,
17].
This work explores a new method for the habitat mapping of P. oceanica meadows, combining datasets collected by unmanned systems (UAVs and ASVs) capable of operating within the region of interest for this habitat, as well as VC performed by scientific divers. The purpose of this integrative method is to make the best use of the advantages in each methodology while producing a habitat classifier which is more efficient than any single approach would provide alone, retaining a balance between absolute performance and range of application, which are equally important in an operational mapping context. The method is based on two classifiers that are decoupled with respect to input, a teacher model based on photogrammetry, and a student model based solely on acoustics. A transfer learning scheme is used to train the student model with dense projections from the teacher model (which, in turn, is trained with GT data) and produce an acoustic classifier with a far better performance than when training directly uses GT data.
In the study area, the student model reached a total accuracy of 85.1% against the GT data, and 89% when considering
P. oceanica meadows alone, a respective improvement of 24.9% and 25.7% when compared with the direct acoustic model. Notwithstanding this the initial training stage, the independence of the student model from optical data makes it applicable at any depth and, in an operational context, mapping can be solely supported by a single surveying device (the ASV). In addition, although the acoustic depth measurement was intentionally kept out of the student classifier feature list to produce a depth-independent model, the MBES depth information remains available and allows for the 3D mapping of the entire study area and thus the bathymetric distribution of the
P. oceanica meadows, information which cannot yet be reliably retrieved by optics for the entire depth range of this habitat [
24,
25,
26].
From a different perspective, this method makes feasible the use of a compact, low-cost MBES appropriate for autonomous use to reconstruct habitat information with similar fidelity to a highly accurate camera system. It is important to note that without the knowledge transfer from the optical dataset (teacher model), the MBES BS- and BS-derived mosaics were not adequate to produce an effective classifier from the GT data alone. Furthermore, the secondary features based on the BS mosaic at different scales proved to be necessary to obtain a high-performing classifier with the student model, which is similar to findings in other studies [
15,
44,
45].
Due to the study area topography, it was impossible to verify the method’s applicability beyond the training depth or the validity of the predictions over a wide depth range. Although, as already discussed, depth and depth-derived features are not explicitly included in the model, and the range, slope, and other topography-dependent effects on MBES BS measurements remain important [
46,
47,
48,
49], as they could come into play and affect the model’s performance. The model’s secondary features, based on wide spatial-scale statistics, could perhaps be more robust to such effects, especially in the case of the depth ranges of interest for
P. oceanica meadows, but the method’s applicability remains to be verified.
Considering the operational applicability of the method, one aspect is the model’s transferability to other areas. It is expected that habitat types other than those found in the training dataset of this study (rock, clay, sand with a different consistency, etc.) will pose an issue for the current model instance. It would also be especially interesting to see how the method would perform in the presence of different seagrass meadows, e.g., patches of
Cymodocea nodosa. Each time a new habitat type is introduced to the model, retraining with a new photogrammetry dataset will be necessary. However, it is expected that the additional training stage will be required less often as the model is expanded to incorporate more habitat types. Ultimately, a training library generated by pooling data from a wide range of habitat types is expected to produce a universal model, completely obviating the need for UAV measurements. In addition to the above, as it uses an MBES which was not calibrated for BS, the trained model is an empirical model adapted to the specific instrument used during the field survey. Using different equipment will most likely require that the training stage is repeated, while the other alternative would be to perform an MBES calibration procedure to standardize the model. However, BS calibration in MBES is still a subject of active research [
49,
50], and depending on the task, an instrument-specific empirical model could be an acceptable compromise. Another future step for the presented method could be improving the teacher model, e.g., by using secondary features similar to the BS spatial statistics utilized in the case of the student classifier or image segmentation/feature extraction before the learning stage. Any improvement in the teacher model will likely benefit the student model as well and improve the methodology in general.
5. Conclusions
Following the appropriate adaptations, the presented method provides a final model which could be the workhorse in an operational mapping application. This approach makes it possible to utilize an ASV equipped with a portable MBES to produce results of similar quality to those of a UAV with an optical system. Moreover, by operating only with acoustic data, the model does not have the depth limitations of the UAV optics and also has the benefit of providing bathymetric information, which can be utilized for 3D habitat mapping. An ASV MBES survey designed to cover a region of interest can thus provide the necessary dataset for the model to produce a geo- and depth-referenced predicted habitat map. Finally, the method can be adapted to incorporate additional habitat types through ad hoc UAV surveys, allowing the model to integrate the new classes upon first encounter.
Considering the empirical nature of the produced model and the fact that all of the presented analysis can be automated, the integration of the algorithm into the ASV is also possible, providing opportunities for further automations, e.g., habitat map generation or flagging areas with low classification probabilities.