1. Introduction
Marine environments are experiencing changes due to the cumulative effects of human activities, natural dynamics, and global climate variations [
1]. The coastal zones of the Mediterranean are particularly vulnerable to these impacts, which serve as the primary drivers of the ongoing environmental changes in the region. Consequently, there is an increasing emphasis on developing a comprehensive understanding of the hydrodynamics of the Mediterranean Sea [
2]. This interest stems from the association of numerous water parameters with extreme weather events, such as torrential rainfall and prolonged droughts, as well as their direct environmental consequences, including coastal flooding, beach erosion, degradation of coastal vegetation and agricultural lands, and saltwater intrusion into coastal aquifers, lagoons, and other freshwater resources [
3]. Moreover, climatic variability has been shown to negatively impact fisheries in the region by driving changes in phytoplankton and zooplankton populations, which form the foundation of the marine food web [
4]. The frequency of these phenomena has increased in recent decades, emphasising the need to examine the hydrodynamics of the region in order to gain a deeper understanding of their effects.
Numerous regions within the Mediterranean Sea exhibit notable variations in their spatial extent, complex structural patterns, and temporal dynamics. Protecting these ecosystems requires accurate monitoring of their physical characteristics and governing processes at fine spatial and temporal scales [
5]. Given the anticipated global climate change, gaining a comprehensive understanding of the evolutionary patterns exhibited by marine ecosystems is imperative. Thus, monitoring plays a vital role in oceanic protection and management [
6]. Consequently, to effectively evaluate physical aquatic dynamics, it is essential to acquire precise and up-to-date information regarding several parameters, including sea surface salinity and sea surface temperature [
7]. Therefore, to gain a more profound understanding of these productive ecosystems, employing remote sensing techniques can serve as a valuable tool for comprehending the distinctive characteristics and mechanisms operating within marine regions [
8]. Hence, remote sensing can assume a crucial role in overseeing and surveying oceanographic mechanisms, contributing significantly to the advancements in the realms of societal, cultural, and economic domains by virtue of the acquisition of newly acquired up-to-date data.
Satellite remote sensing proves to be a valuable instrument to observe aquatic ecosystems. The regular overpass of satellites facilitates the scheduled and economically efficient acquisition of diverse observations across extensive and inaccessible regions within brief time frames [
9]. Due to the diverse number of sensors, methodologies, and platforms utilised, satellite observations exhibit variations in their spatial, spectral and temporal attributes, with distinct satellites being tailored for specific applications [
10]. Additionally, integrating in situ data with satellite data has created more opportunities to investigate the ever-evolving processes within hydrographic regions, encompassing extensive spatial coverage and extended temporal durations. Nevertheless, dealing with intricate datasets, along with the task of recognising the underlying patterns and trends, proves to be challenging. As a result, advanced techniques, such as Machine Learning, have been explored, which facilitate the extraction of data in a more efficient, precise, and automated manner [
11]. Machine Learning offers numerous benefits when compared to traditional approaches, as it can construct models that encompass highly dimensional and nonlinear data characterised by building complex relationships. This capability proves valuable across a wide array of applications within various components of Earth systems [
12].
One area where remote sensing and Machine Learning can be utilised is in the measurement of sea surface temperature (SST) and sea surface salinity (SSS), which serve as fundamental components in marine research. Both these parameters are pivotal metrics to gauge climate variability, oceanic conditions, and ecological dynamics. By evaluating surface temperature and salinity, researchers acquire invaluable insights into climate trends, localised weather occurrences, fluctuations in marine biodiversity, ocean circulation patterns and water quality [
13,
14]. Methods for monitoring SST and SSS encompass in situ measurements, satellite observations, and model-generated data. In situ approaches that acquire SST and SSS measurements involve directly sampling water temperature and salinity at specific locations. Common in situ techniques include ship-based measurements, observations from buoys, drifters, gliders and floats and instrumented moorings that produce long-term data [
15,
16]. The SST and SSS measurements obtained through these in situ techniques are subsequently utilised to authenticate and evaluate the precision of the satellite-derived data [
17]. Furthermore, by merging the data generated by satellites with the extensive records of in situ measurements, it becomes feasible to create more intricate reconstructions of the observable marine patterns over a broader expanse of the Earth’s surface [
18]. This integration enables stakeholders to effectively evaluate and monitor changes in these physical parameters with both temporal precision and broad spatial coverage [
19].
Although the integration of satellite data, in situ data, and Machine Learning holds the potential to revolutionise marine science, the full potential of remote sensing technologies remains largely untapped within the field of environmental management [
20,
21,
22]. The authors of [
23] delineated four primary themes contributing to the underutilisation of these technologies. These themes encompass concerns related to the costs and precision of data products, especially related to highly dynamic parameters, uncertainties regarding the continuity of satellite missions, and challenges associated with obtaining administrative approval for integrating remote sensing into the decision-making process. The latter theme proves essential as the future of applied remote sensing hinges on the maritime managers, who serve as the ultimate end-users, to acknowledge the necessity for data capable of unveiling alterations in patterns across an almost uninterrupted spatial range, highlighting disturbances and features within these oceanic regions [
8]. Hence, the major challenge lies in establishing remote sensing techniques as a customary tool for evaluating changes in aquatic zones by acquainting end-users with the extensive array of existing satellite data and the available imagery that adequately meets their requirements [
24]. Achieving this goal would transform this sector from a data-poor field marked by limited spatiotemporal coverage into a data-rich field where new hydrographic data is collected regularly and with high accuracy [
25]. Therefore, this study aims to integrate in situ data, satellite data, and a Machine Learning algorithm, namely random forest, to generate accurate spatial and temporal sea surface salinity (SSS) and sea surface temperature (SST) maps for the Maltese archipelago through multispectral satellite products. Additionally, this paper intends to propose an efficient methodology for acquiring local marine physical data, which can subsequently be utilised within a wider European context, encompassing larger regions characterised by greater temporal and spatial variability.
2. Materials and Methods
2.1. Study Area
This study took place within the Maltese archipelago, which consists of three primary islands, Malta, Gozo, and Comino, spanning a combined area of 316 km
2 (
Figure 1). Situated in the Mediterranean Sea within the Strait of Sicily, between Sicily and Tunisia, the archipelago boasts approximately 190 km of coastline. Its terrain is predominantly characterised by cliffs, clay slopes, and boulder formations, contributing to a diverse array of underwater topographic and ecosystem features. The seabed encompassing the Maltese Islands comprises a broad, level continental shelf situated predominantly on the eastern sides of Malta and Gozo, bordered by a continuous escarpment that parallels the coastline [
26]. Conversely, on the western periphery of the islands, the shelf narrows considerably, stretching only a few hundred meters from the coast and featuring steep cliffs reaching heights of approximately 100 m [
27]. Consequently, a prevailing pattern emerges wherein the waters to the west of these islands exhibit greater depth compared to those on the eastern side.
Located at the heart of the Mediterranean Sea, the Maltese Islands are influenced by various oceanic processes, including the thermohaline circulation, which is a distinctive water movement pattern affecting both temperature and salinity [
28]. Furthermore, the Mediterranean undergoes considerable seasonal temperature fluctuations that impact ocean dynamics through variations in evaporation and precipitation rates, subsequently altering water density and circulation patterns due to changes in salinity levels [
29]. Consequently, these dynamics render the Maltese archipelago an ideal location for evaluating spatial and temporal variations in SSS and SST.
2.2. In Situ Data
The SeaExplorer sea glider obtained through the Oceanography Malta Research Group at the University of Malta (
Figure 2) facilitated the acquisition of real-time, in situ measurements for SSS and SST. By leveraging changes in buoyancy to glide horizontally and vertically through the water column, it reached depths of up to 700 m, consistently gathering salinity and temperature data, which was transmitted back to the researcher upon surfacing. The sea glider was employed in two separate missions within the Mediterranean Sea area. The initial mission, launched in the southern vicinity of the Maltese Island near Filfla, extended for a duration of 38 days, commencing on 30 July 2021 and concluding on 6 September 2021. Its trajectory encompassed a journey towards Northern Tunisia before returning to Malta. Conversely, the subsequent mission commenced in the Gozo Channel, situated in the northern region of Maltese Island, lasting for a period of 29 days, running from 9 November 2021 to 8 December 2021. Its course was oriented towards southern Tunisia before returning to Malta. As this research focused exclusively on the sea surface data gathered by the sea glider, in situ data underwent a cleansing process, with any data collected below the initial 5 m depth level being excluded. The missions and satellite data employed in this study are depicted in
Figure 3. This data was utilised to generate predictions for SSS and SST for the years 2022 and 2023.
Regarding the 2024 SSS and SST measurements, in situ salinity and temperature data were acquired using an ARGO float equipped with a CTD instrument (
Figure 2), deployed north of Gozo on 22 June 2024 (Website containing the information of the Argo float used:
https://fleetmonitoring.euro-argo.eu/float/7901135, accessed on 15 June 2024). The float primarily spends its time drifting with deep ocean currents and subsequently collects a series of measurements while moving through the water column. Upon reaching the surface, the float determines its location, typically via GPS, and then transmits its collected data through a satellite. Consequently, similar to the sea glider, the dataset from the float required cleaning by excluding measurements taken below the sea surface threshold, retaining only the SSS and SST measurements. The data points collected by the ARGO float are illustrated in
Figure 4.
2.3. Satellite Data
Apart from the collected in situ data, this study employs satellite imagery sourced from the Sentinel-2 constellation, comprising two Earth observation satellites, namely Sentinel-2A and Sentinel-2B, launched in 2015 and 2017, respectively [
30]. Both of these satellites orbit the Earth in a sun-synchronous manner and collectively offer a five-day revisit rate over the equator, ensuring data is acquired under consistent viewing circumstances. Additionally, the data is captured at a high spatial resolution and a broad field of view suitable for multispectral observation. Equipped with a Multispectral Instrument (MSI), each satellite can capture reflected radiance across 13 spectral bands ranging from visible to shortwave infrared wavelengths (
Table 1). This extensive spectral range combined with high spatial resolution enables comprehensive analysis of land cover, vegetation health, water quality, and other factors. Moreover, the MSI features a 12-bit radiometric resolution, enhancing the capacity of the satellite to distinguish between the intensity or reflectance of the diverse structures being mapped. The Sentinel-2 images were acquired from the ESA’s Copernicus Sentinel Scientific Data Hub, which offers comprehensive, unrestricted and openly accessible Sentinel user product data.
For the first mission, satellite images captured on 23 August 2021 were used, and images from 23 November 2021 were selected for the second mission. For the float mission, a satellite image captured on 25 June 2024 was utilised. The first two dates were chosen because they correspond to the midpoint of their respective missions, whereas the latter date was selected as it was the closest available image to the in situ measurements. These images were also chosen as they exhibited minimal cloud coverage and sun glint. The 13 spectral bands were extracted from each image. For the first two missions, two additional images were sourced from the Copernicus Sentinel Scientific Data Hub, serving as test sets for predicting SSS and SST for 2022 and 2023 within the designated region and timeframe. These encompassed an image captured on 24 October 2022 and another on 9 October 2023. Furthermore, the satellite image taken on 25 June 2024 was also used for the 2024 test set. Once more, the 13 bands were extracted from these images, and the predictions were based on the in situ measurements obtained in 2021 for the first two years and 2024 for the last year (
Table 2).
In the analysis of SST, only the spectral bands were employed. Conversely, for the SSS analysis, SST satellite data was incorporated in the training phase, given the strong dependence of SSS readings on SST measurements. The SST satellite data corresponding to the aforementioned dates was retrieved from the Copernicus Marine Data Store (Copernicus Marine Data Store website:
https://data.marine.copernicus.eu/products, accessed on 13 August 2024), specifically from the Mediterranean Sea High Resolution and Ultra High Resolution Sea Surface Temperature Analysis directory [
32]. The dataset possessed a spatial resolution of 1 km and was acquired at processing Level 4 (L4), employing interpolation techniques to address potential issues such as cloud coverage or other factors resulting in missing data. Subsequently, the SST data was resized to match the dimensions of the satellite spectral band data. Using SAGA GIS, the SST data was converted into a vector file, facilitating the extraction of SST data for each pixel within the spectral band raster image (
Figure 5).
2.4. Dimensionality Reduction
Upon creating the training datasets for both parameters, which include the variables listed in
Table 3, these datasets were inputted into a dimensionality reduction algorithm to identify the most relevant parameters for the investigation.
This was necessary as integrating irrelevant parameters may lead to complex models that present considerable challenges in both interpretation and execution, in contrast to models constructed with the most essential parameters. To identify the most significant dimensions or parameters in the input space, Principal Component Analysis (PCA) was employed. PCA is a multivariate statistical technique aimed at consolidating information from multiple variables observed on the same subjects into fewer variables [
33]. Following the computation of the PCA, it was established that the principal dimensions essential for predicting SSS and SST encompassed the blue, green, red, and infrared spectral bands. Additionally, for SSS predictions, SST data was significant. Subsequently, the remaining dimensions were excluded from the datasets as they were deemed irrelevant to the present study.
2.5. Prediction Models
In this study, the RF Machine Learning algorithm was utilised to predict the SST and SSS variation around the Maltese Islands. The RF is frequently used due to its high performance and is derived from the Decision Tree algorithm. Several decision trees are assembled by sampling the dataset and randomly choosing features from each subset to create approximations of decision trees. Subsequently, the predicted outcomes of each decision tree are amalgamated to improve the accuracy of the predictive model. The RF algorithm considers nonlinear interactions among variables and can handle both categorical and numerical data types. Unlike conventional empirical or semi-empirical models such as Linear Regression, Multilayer Perceptron and Support Vector Machines, the RF constructs more accurate and adaptable models using available data and is proficient at delivering rapid outcomes, resilient to overfitting, and capable of managing multicollinearity.
The RF algorithm was employed as it represents a supervised Machine Learning approach, requiring labelled data for training. The learning process in the RF models is divided into two stages: training and testing. During the training phase, the learning algorithm utilises samples from the training data to acquire insights into the features, thereby constructing the learning model. Conversely, the test set evaluates the performance of the model on unseen data, gauging its capacity to generalise to new instances. In this study, the training dataset included coordinates, spectral information, in situ data and other relevant data obtained through the PCA. This dataset was divided into an 80/20 ratio, meaning that the training subset encompassed 80% of the original dataset, while the testing subset comprised the remaining 20%. Regarding the RF algorithm, the ‘ntree’ parameter, indicating the number of decision trees constructed by the model, was configured to 100. While constructing more trees could potentially enhance the model’s performance, it could also introduce heightened risks of overfitting, amplification of model complexity, and escalation of computational burden.
Upon construction, training, and testing of the model, the Pearson Correlation Coefficient (PCC) and the Root Mean Square Error (RMSE) were used as error metrics to quantify the linear correlation between the in situ parameter values and the predicted values. The PCC value can range from −1 to 1, with the sign indicating whether the variables are positively correlated or inversely correlated. Furthermore, a higher coefficient denotes a greater degree of co-variation between the two variables. The PCC was chosen over alternative error metrics, such as the Spearman Correlation Coefficient, due to its ability to evaluate the linear relationship between the dependent and independent variables at a constant rate. Contrastingly, other metrics assess monotonic relationships, where variables may increase or decrease at differing rates. The PCC specifically measures the strength of a consistent, proportional relationship. Apart from utilising the PCC, an RMSE value was also provided to quantify the difference between predicted values and in situ values. A lower RMSE value indicates that the predictions are closer to the actual values, suggesting a better-performing model. Lastly, the outcomes of the model were plotted to visualise and provide a graphical representation of its performance.
Once the RF algorithm was trained and tested for each parameter, the satellite data from October 2022 and 2023 and June 2024, encompassing the complete datasets of the Maltese archipelago, were employed to predict these parameters for the entire region of interest. The predicted parameter outcomes were subsequently validated by comparing the predicted outcomes to model data derived from the Marine Copernicus Data Hub (
Figure 6). Finally, the October 2022, October 2023, and June 2024 results were compared to understand the spatial and temporal variations of the two parameters occurring within the designated region.
To summarise, the input parameters used to construct the RF Machine Learning model included in situ SST and SSS data, along with the red, blue, green, and infrared bands from Sentinel-2 satellites. The model was subsequently developed using the Pandas library in Python (v.3.13). The ‘sklearn.ensemble’ module from the scikit-learn Python Machine Learning library provided a built-in function needed to create the random forest regressor model. Once the model was imported, the training data was fitted into the model to enable the algorithm to learn from the data and construct the model. The ‘sklearn.model_selection’ module was utilised to implement the training-to-testing split ratio as the validation method. Consequently, predictions were made on the testing data to compare the predicted values with the actual in situ values, and the performance of the model was assessed using the ‘sklearn.metrics’ module to obtain the PCC and RMSE performance metrics. Lastly, the output parameters of the model included the predicted SST and SSS, along with the corresponding longitude and latitude of the predicted points. The complete predictive modelling process is depicted in
Figure 7.
3. Results and Discussion
3.1. Random Forest Performance
Datasets from October 2022, October 2023, and June 2024 were utilised to train and test the RF algorithm for both SSS and SST. The initial two datasets included in situ SSS and SST data gathered between August and December 2021, while the June 2024 dataset included in situ data collected in June 2024 using a float. Moreover, these datasets incorporated four spectral bands extracted from satellite data for the years 2022, 2023, and 2024, respectively. These were utilised to assess the predictive accuracy of the RF algorithm for more dynamic parameters compared to bathymetry. The PCC and RMSE were employed to evaluate the relationship between the in situ SSS and SST data and the corresponding predictions made by the RF algorithm. The results are summarised in
Table 4. Furthermore, correlation plots were generated between the in situ data and the predicted sea surface parameters to visualise these relationships, as shown in
Figure 8.
The RF algorithm attained very high correlation coefficients and low RMSE values for SSS in 2022 and 2023, demonstrating a strong linear relationship between the predicted SSS values and the in situ measurements. This is further validated by the SSS scatter plots, which show a tight clustering of points around the diagonal line, indicating a close relationship between the predicted and actual values. The minimal differences between the two years demonstrate that the performance of the model remains highly consistent over time. Conversely, in 2024, there is a noticeable decline in the correlation strength compared to previous years. This reduction can likely be attributed to the substantially smaller training dataset available for this year, comprising only 40 data points, compared to the 400 data points in the datasets for 2022 and 2023. Despite this, the predicted SSS values remain closely aligned with the observed in situ SSS values. This stability suggests that the RF algorithm reliably captures the underlying patterns and variability in SSS despite potential interannual changes in the data. Thus, the RF algorithm effectively models the factors influencing salinity, including freshwater inputs, evaporation, and ocean currents.
Similarly, the SST correlation coefficients for 2022 and 2023 are high and positive, indicating a strong and consistent correlation, whilst the coefficient for 2024 is significantly lower due to the same factors affecting the 2024 SSS dataset. This consistency further suggests that the performance of the model is stable and capable of generalising well across different temporal datasets. Despite the high correlation, the lower coefficients and higher RMSE values for SST compared to SSS suggest that predicting SST is inherently more difficult. This difficulty likely arises from the complex interactions of factors affecting temperature, including atmospheric conditions, ocean-atmosphere heat exchange, and regional climatic variability. Additionally, the greater dispersion of data points seen in the SST scatter plots in
Figure 8 further indicates that SST data exhibits higher variability and more noise compared to SSS.
The higher correlation coefficients and closely clustered scatter plots for SSS relative to SST indicate that the RF algorithm more accurately predicts salinity than temperature. This difference in performance may be due to the characteristics of the data and the underlying physical processes. SSS tends to display more stable and predictable patterns, while SST is more susceptible to short-term and small-scale fluctuations. This rationale can similarly be applied when considering parameters such as bathymetry in relation to the two sea surface parameters. Bathymetry tends to exhibit more stable variations, as it is subjected to less change compared to the more variable salinity and temperature parameters. Therefore, the stability in variation of a parameter is associated with the effectiveness of the RF algorithm in learning the underlying patterns within the datasets.
Moreover, comparing the scatter plots and PCCs from 2022 and 2023 to those of 2024 clearly demonstrates that the RF algorithm requires a substantial amount of data to effectively learn the underlying patterns and relationships within the dataset. This need for sufficient data is evidenced by the relatively lower PCCs observed in 2024 compared to the preceding years. The strength of the RF algorithm lies in the diversity and independence of its individual trees. A larger number of data points ensures that the trees are sufficiently diverse from one another, thereby enhancing the overall performance of the ensemble and providing more reliable statistical estimates. As the dataset size increases, RF is able to identify a broader range of patterns and relationships, thereby enhancing the robustness and accuracy of the ensemble model. In contrast, smaller datasets may lack the diversity required for the trees to learn effectively, resulting in reduced prediction performance. Furthermore, a larger dataset enables the RF to discern more generalised patterns rather than overfitting to noise or irrelevant details, helping to avoid overfitting and better capture the subtle interactions and dependencies between features. In comparison, smaller datasets may fail to represent the full complexity of these relationships. Nevertheless, the strong performance for both parameters demonstrates the robustness and adaptability of the RF algorithm in handling various types of oceanographic data.
3.2. SSS and SST Maps
To evaluate the spatial and temporal fluctuations in SSS and SST within the Maltese Islands, the RF algorithm generated three separate maps for each parameter: one for October 2022, another for October 2023, and a third for June 2024. These maps depict the extent of SSS and SST variations surrounding the Maltese Islands, as depicted in
Figure 9.
In 2022, the SSS values range from 37.69 to 37.79 PSU, exhibiting a noticeable gradient. The southern and southwestern areas of the region show comparatively lower salinity values than the northern and northeastern regions. Conversely, in 2023, the SSS values range from 37.55 to 38.55 PSU. There is a general increase in salinity throughout the region, with the western region displaying a higher saline concentration than the eastern region. In 2024, the SSS values range from 37.20 to 38.20 PSU, indicating a reduction in salinity compared to the preceding year. Nonetheless, the pattern observed in 2023 persists in 2024, with the western regions exhibiting the highest salinity levels. These fluctuations in SSS from 2022 to 2024 indicate a potential climatic shift or alteration in oceanographic conditions, potentially attributed to factors such as changes in precipitation, evaporation rates, or current patterns.
Regarding SST, the values in 2022 vary from 299.5 K to 302.5 K, with the highest temperatures observed in the southeastern section of the region. Conversely, the SST values in 2023 range from 295.0 K to 302.0 K, whilst those in 2024 range from 297.0 K to 299.0 K. These values display a decreasing trend over the three years, indicating a cooling pattern in the SST, which could be due to a potential alteration in climatic conditions or oceanic currents. The spatial variability in SST also diminishes from 2022 to 2024. This trend may be attributed to increasing environmental stability within the region. Alternatively, as previously mentioned, it may also result from using a smaller dataset for the RF model in 2024, which could reduce its ability to accurately capture data variability, leading to less precise map predictions.
A key finding from these maps is that higher variability in SST corresponds to higher variability in SSS, indicating the interdependence of SSS on SST. The observed inverse correlation, where an increase in SSS coincides with a decrease in SST, may imply climatic or oceanographic phenomena such as the intrusion of cooler, saltier waters or seasonal fluctuations in freshwater input and evaporation [
34]. In the context of the Maltese Islands, various specific factors contribute to the observed inverse correlation between SSS and SST. The Mediterranean climate is characterised by warm, arid summers and rainy winters [
35]. In summer, elevated temperatures increase evaporation whilst the impact on salinity variability is mitigated by other factors such as precipitation events. Moreover, the inflow of different water masses into the Mediterranean Sea, such as Atlantic waters entering via the Strait of Gibraltar, impacts local salinity and temperature patterns. Cooler Atlantic waters exhibit lower salinity levels in comparison to warmer, more saline Mediterranean waters [
36]. While the relationship between SST and SSS is highly significant in a marine context, it is not the primary focus of this study. Consequently, this correlation is not explored in greater detail, as the main objective is to assess the accuracy of the RF model in predicting these parameters.
3.3. RF Predictive Accuracy for SSS and SST
Figure 10 depicts the predictive accuracy of the RF algorithm for SSS and SST data in 2022, 2023, and 2024. Each graph contrasts the predicted SSS or SST values with unseen model data sourced from the Marine Copernicus Data Hub, providing a visual evaluation of the accuracy and generalisation capabilities of the algorithm.
The scatter plot representing SSS data for 2022 exhibits points that are moderately scattered around the line of best fit, indicating some variability in the predictions. Nonetheless, it is important to comprehend that the predicted SSS values for 2022 closely resemble the model data, implying that the RF algorithm generated accurate predictions, albeit with slight variations. In contrast, the scatter plots for SSS in 2023 and 2024 exhibit a stronger and more positive clustering of points around the line of best fit, indicating that the RF model demonstrated superior predictive performance compared to 2022.
Regarding SST, both 2022 and 2023 exhibit a comparable pattern, demonstrating a robust correlation between predicted and model values. The data points are closely grouped around the line of best fit, indicating a high level of predictive precision. Conversely, the predictive accuracy for SST in 2024 demonstrates a weaker positive correlation, indicating reduced accuracy, which may once again be attributed to the size of the 2024 dataset. Nonetheless, the predicted SST values for 2024 remain highly comparable to the model SST values despite showing some minor variations. Hence, the results reaffirm that the RF model serves as a dependable tool for SST prediction, given its consistently high accuracy across the two years.
While the scatter plots in
Figure 10 demonstrate positive correlations between the predicted and model data, employing in situ data could have yielded more reliable and precise analysis compared to using model-generated data. This is primarily because in situ measurements represent the actual conditions for ground truthing and offer a direct reference to evaluate the accuracy of predictive models. Moreover, in situ measurements can offer real-time data at a finer resolution, capturing temporal variations more accurately compared to model data, which may rely on periodic sampling or time-averaging. However, due to the absence or insufficient acquisition of in situ data for SSS and SST in 2022, 2023, and 2024, the utilisation of model data was necessary, serving as a satisfactory substitute for in situ data.
In this study, an integrated approach was utilised, and it proved to be successful and insightful. However, several limitations were encountered, which may have impacted the overall direction of the study or the accuracy of the results. These limitations stem from challenges inherent in the use of satellite and in situ data, the complexity of accurately modelling marine environments, and the constraints associated with the Machine Learning technique employed. The following summarises the key limitations observed during the study.
The spatial resolution could only be a maximum of 10 m as that is the highest resolution of the Sentinel-2 datasets. Therefore, certain fine-scale variations within the marine environment might have been overlooked. Additionally, in situ data was not obtained for SSS and SST validation due to time and cost constraints. Therefore, model data, which is less accurate than ground truth data, was required to validate the results. Lastly, the in situ dataset for SSS and SST in 2022 and 2023 had in situ measurements from a sea glider survey collected in 2021, whilst the in situ dataset for 2024 might have been too small, which could have led to errors within the RF algorithm during training and testing. As a result, this data might not have been sufficient for the algorithm to understand the underlying patterns within the dataset. Therefore, future studies could benefit from the acquisition of a larger dataset.
These limitations highlight the challenges faced in accurately predicting SSS and SST using a combination of satellite data, in situ observations, and MLAs. Recognising these limitations is crucial for contextualising the findings, understanding the potential sources of error, and guiding future research to enhance predictive capabilities and data integration methodologies.
4. Conclusions
Remote sensing and Machine Learning present many opportunities to improve our understanding of marine ecosystems. Therefore, this study was performed to dive into the potential of using such data and techniques to understand better the physical process occurring around the Maltese Islands. Initially, a set of aims and objectives were defined, and this research sought to reach these. The main aim was to integrate in situ data, satellite data, and MLAs to generate accurate SSS and SST predictions for the Maltese archipelago. Thus, an empirical workflow was implemented by utilising the Sentinel-2 satellite platforms, the RF Machine Learning technique, and in situ data obtained through sea gliders and ARGO floats. Subsequently, numerical data produced by the RF Machine Learning method was transformed into visual data in the form of correlation graphs and SSS and SST maps of the Maltese Islands to give a better understanding of the spatial and temporal changes occurring within this region.
The main findings from this study show that integrating satellite-derived data, in situ data, and MLAs proved to be great at obtaining accurate predicted SSS and SST data, illustrating the potential of utilising all three techniques in tandem. Additionally, the RF performance for both these parameters shows that the algorithm is capable of handling dynamic parameters, producing highly accurate predicted parametric maps of SSS and SST for three separate years, 2022, 2023, and 2024. This gave a good understanding not only of the spatial but also the temporal changes occurring within these parameters surrounding the Maltese Islands. Lastly, the SSS and SST results proved that the RF algorithm requires a substantial amount of in situ data to effectively learn the underlying patterns and relationships within the dataset, as evidenced by the reduced PCCs of SSS and SST results in 2024 in comparison to the 2022 and 2023 PCCs.
Following comprehensive data collection and rigorous analysis, the study provided valuable insights into the use of satellite data and MLAs to predict salinity and temperature. These are crucial parameters for marine monitoring and are essential for numerous applications, including marine conservation, resource management, climate change preparedness, water quality management, erosion, sediment management, and much more.
This study proved that satellite data can be utilised to predict highly dynamic marine parameters. However, as our reliance on satellite-based monitoring increases, this research has also illustrated that through the use of in situ instruments, traditional monitoring techniques should not be neglected. Most Machine Learning applications that utilise satellite imagery depend on ground-based data for training and validation. Consequently, ongoing in situ surveys, particularly those at regularly monitored locations, are crucial to assess the growing range of remote sensing platforms, techniques and data as a means to improve the overall scientific understanding of these methodologies. Additionally, the methodology employed in this study, despite being applied within a relatively small area surrounding the Maltese Islands, demonstrates significant potential for implementation in larger and more extensive regions. This adaptability stems from the reliance of the utilised approach on the integration of in situ data and satellite data rather than being inherently tied to the specific study location. However, a key limitation of this methodology lies in its dependence on the quality of the satellite data. Climatic and oceanographic conditions, such as cloud coverage, sun glint, and turbid waters, can compromise the accuracy of the RF model by introducing outliers into the satellite dataset. These outliers arise from spectral discrepancies in comparison to typical aquatic regions. To mitigate this, it is crucial to use satellite images that are free from such conditions and instead provide clear and uninterrupted visuals. Similarly, the quality of in situ data is pivotal, as the RF algorithm relies on this data for its predictions. Employing precise instruments, such as drifters, sea gliders, and floats, to accurately measure sea surface parameters is essential for extending this methodology to broader study areas and reducing errors within the algorithm.
By addressing the limitations and leveraging new technologies and methodologies, future research can enhance the accuracy, resolution, and applicability of these predictive models. The following recommendations outline key areas for future work, offering a pathway to build upon the current findings and push the boundaries of what can be achieved in marine environmental prediction and monitoring. The untapped potential of utilising drone data to predict marine parameters can be a new opportunity that offers the researcher results at a higher resolution, some of which can reach a centimetre scale. This would provide more detail and give a better understanding of the small changes occurring within our marine regions compared to the utilisation of satellite data. Secondly, gathering a large SSS and SST in situ dataset can help further validate the results of this study as this would allow the MLAs to learn on a large number of data points, better understanding the underlying patterns within these coastal parameters and therefore, should provide a more accurate and reliable model to predict SSS and SST. Subsequently, acquiring additional ground-truth data can mitigate the issue of relying on less accurate model data, thereby providing a better understanding of the accuracy levels of the algorithm. Lastly, the methodology utilised in this study can also be applied to other dynamic marine parameters such as chlorophyll to understand whether this approach can be applied to them and, if so, what level of accuracy the predictions can reach.