Next Article in Journal
A Comparison of Machine Learning and Geostatistical Approaches for Mapping Forest Canopy Height over the Southeastern US Using ICESat-2
Previous Article in Journal
Total Phosphorus and Nitrogen Dynamics and Influencing Factors in Dongting Lake Using Landsat Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Comparison of Machine Learning Approaches for Reconstructing Sea Subsurface Salinity Using Synthetic Data

1
College of Meteorology and Oceanography, National University of Defense Technology, Changsha 410073, China
2
Institute of Atmospheric Physics, Chinese Academy of Sciences, Beijing 100029, China
3
National Climate Center, Chinese Meteorological Administration, Beijing 100081, China
4
Eco-Environmental Monitoring and Research Center, Pearl River Valley and South China Sea Ecology and Environment Administration, Ministry of Ecology and Environment of the People’s Republic of China, Guangzhou 510611, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2022, 14(22), 5650; https://doi.org/10.3390/rs14225650
Submission received: 23 September 2022 / Revised: 27 October 2022 / Accepted: 5 November 2022 / Published: 9 November 2022
(This article belongs to the Section Ocean Remote Sensing)

Abstract

:
There is a growing interest in using sparse in situ salinity data to reconstruct high-resolution three-dimensional subsurface salinity with global coverage. However, in areas with no observations, there is a lack of observation data for comparison with reconstructed fields, leading to challenges in assessing the quality and improving the accuracy of the reconstructed data. To address these issues, this study adopted the ‘resampling test’ method to establish the ‘synthetic data’ to test the performance of different machine learning algorithms. The Centre National de Recherches Meteorologiques Climate Model Version 6, and its high-resolution counterpart (CNRM-CM6-1-HR) model data was used. The key advantage of the CNRM-CM6-1-HR is that the true values for salinity are known across the entire ocean at every point in time, and thus we can compare the reconstruction result to this data. The ‘synthetic dataset’ was established by resampling the model data according to the location of in situ observations. This synthetic dataset was then used to prepare two datasets: an ‘original synthetic dataset’ with no noise added to the resampled truth value and a ‘noised synthetic dataset’ with observation error perturbation added to the resampled truth value. The resampled salinity values of the model were taken as the ‘truth values’, and the feed-forward neural network (FFNN) and light gradient boosting machine (LightGBM) approaches were used to design four reconstruction experiments and build multiple sets of reconstruction data. Finally, the advantages and disadvantages of the different reconstruction schemes were compared through multi-dimensional evaluation of the reconstructed data, and the applicability of the FFNN and LightGBM approaches for reconstructing global salinity data from sparse data was discussed. The results showed that the best-performing scheme has low root-mean-square errors (~0.035 psu) and high correlation coefficients (~0.866). The reconstructed dataset from this experiment accurately reflected the geographical pattern and vertical structure of salinity fields, and also performed well on the noised synthetic dataset. This reconstruction scheme has good generalizability and robustness, which indicates its potential as a solution for reconstructing high-resolution subsurface salinity data with global coverage in practical applications.

1. Introduction

Complete and reliable gridded ocean data are fundamental to climate change and oceanographic research [1,2,3,4,5,6]. However, the limitations of observation methods and the high observation costs mean that in situ salinity observations are sparse, which severely limits in-depth investigation of ocean salinity variability and the underlying mechanisms [7,8,9]. Therefore, marine scientific research must use sparse in situ observations to reconstruct high-quality, long-term time-series and high-resolution gridded salinity data with global ocean coverage to ensure the accuracy of new data in large-scale climate change and marine environmental change studies. However, achieving such reconstruction is a challenge and a major conundrum in ocean data research [10,11,12,13,14].
Most previous studies on the reconstruction of ocean salinity data have focused on observed areas using, for example, the DOM method [4], the EN method [15], the LEV method [16], the Ishii method [17], and Institute of Atmospheric Physics (IAP) gridding method proposed by Cheng et al. [10,18]. These methods perform quality assessments and validation of reconstructed data based on areas with observed data. However, in no-observation areas, the reconstructed field lacks the real salinity observation data for comparison, which leads to difficulties in performing a quality assessment of the reconstructed data. Thus, it is essential to conduct comparative evaluation research on reconstruction data in no-observation areas. Meanwhile, the reliability and applicability of the reconstruction method in no-observation areas require further research. In addition, it is difficult to obtain accurate large-scale features in reconstructed data from sparse in situ observations, and information from satellite remote sensing plays a more important role in the upper ocean than in the deep ocean because surface information is more relevant to the upper ocean than to the deep ocean. Therefore, using only surface remote sensing and in situ observation data can decrease the accuracy of the reconstructed data for the deep ocean [8]. Thus, it is important to determine an approach for introducing appropriate large-scale salinity background information in the reconstruction and improving the reconstruction performance of salinity at varying depths. It will also result in an effective improvement of conventional reconstruction methods.
In recent years, researchers have started using machine-learning (ML) techniques to reconstruct ocean data [19,20,21,22,23,24]. ML is a data-driven technology that learns and models patterns in historical data and predicts desired outputs, which is particularly advantageous in processing noisy and incomplete data [25,26]. Previous studies have reported that the ML approach is an ideal strategy to capture the complex, nonlinear behaviour between the sea surface and in situ observations. For example, Chen et al. [27] reconstructed the pCO2 in the Gulf of Mexico using the multi-linear regression (MLR), multi-nonlinear regression, principal component regression, decision tree, supporting vector machine (SVM), multilayer perception neural network, and random forest (RF)-based regression ensemble (RFRE) methods, and performed extensive validation, evaluation, and sensitivity analyses. The results showed that the RFRE model exhibited the highest robustness in estimating the surface pCO2 in most Gulf of Mexico waters. Wang et al. [28] compared four ML approaches [the extreme gradient boosting (XGBoost), MLR, RF, and the neural network (NN)] in terms of their performance in estimating subsurface temperatures in the western Pacific Ocean using gridded monthly Argo data as truth values. They found that the NN approach outperformed the other three ML approaches. Furthermore, Lu et al. [29] estimated subsurface temperatures based on a clustering-NN method using gridded Argo data as truth values, and found that this approach outperformed the clustering-linear regression and RF methods. Most previous data reconstructions have used continuous gridded data as truth values for model training. However, it is unclear whether these ML methods exhibit the same applicability when training and reconstructing with sparse, noisy, in situ observation data. Therefore, it is necessary to investigate the applicability of typical ML algorithms for reconstructing global coverage ocean salinity data from sparse in situ observations.
This study focuses on investigating and improving the deficiencies of current research on the reconstruction of global ocean salinity, which involves the following three key scientific issues: (1) How to assess the quality of reconstructed data in no-observation areas. (2) How to improve the accuracy of reconstruction for deep-sea regions while preserving large-scale features under the conditions that observations are sparse and satellite observations do not reflect the deep-sea salinity distribution. (3) How to select an ML approach that is suitable for reconstructing global high-resolution subsurface salinity data from sparse field observations.
To solve these problems, we used data from the Centre National de Recherches Meteorologiques—Climate Model Version 6 and its high-resolution counterpart (CNRM-CM6-1-HR) [30] along with the ‘resampling test’ method to evaluate the gridded products and reconstruction methods on the basis that the model results have global coverage and that the true values of salinity are known. First, we resampled the CNRM-CM6-1-HR model data using the locations of in situ observations, considering the resampled data of the model as synthetic data, meaning the reconstructed result can be compared with the real value of the synthetic data. Such an approach helps overcome the difficulty associated with no-salinity observations for comparison with reconstructed fields. Second, we re-gridded the salinity data of the CNRM-CM6-1-HR model to 1° × 1°, and these data were used as the background field for data reconstruction. The addition of background fields not only preserves the large-scale salinity characteristics of the reconstructed data but also improves the reconstruction accuracy of the deep-ocean area. Third, based on the synthetic data, the global-scale salinity was reconstructed by designing different experimental schemes using feed-forward neural network (FFNN) and light gradient boosting machine (LightGBM) approaches. The reconstructed salinity data provide global coverage of the oceans at a horizontal resolution of 0.25° × 0.25° in 41 vertical levels from a depth of 1 m to 2000 m and a monthly temporal resolution from 1993 to 2018. A multidimensional evaluation of the reconstructed data was performed to analyse the strengths and weaknesses of the FFNN and LightGBM approaches for reconstructing the global coverage of ocean salinity data from sparse in situ observations. This process helps in the selection of a reasonable ML method for data reconstruction.
The rest of the paper is organised as follows. The data and methods used in the study are described in Section 2. The performances of the different experimental schemes are assessed and presented in Section 3. The study’s conclusions are summarized in Section 4.

2. Data and Methods

2.1. Data

2.1.1. Coupled Model Intercomparison Project Phase 6 (CMIP6) Model Data

This study was based on data from the CNRM-CM6-1-HR model (0.25° × 0.25°) in the High-Resolution Model Intercomparison Project of CMIP6 (https://esgf-node.llnl.gov/search/cmip6/, accessed on 5 September 2021). The CNRM-CM6-1-HR model belongs to the National Centre for Meteorological Research, Météo-France, and CNRS laboratory [30]. The key advantage of the CNRM-CM6-1-HR is that the true values for salinity are known, and thus it was possible to compare the reconstruction result to these true value data. Remote-sensing observations of surface ocean absolute dynamic topography (ADT), sea surface temperature (SST), and sea surface wind field (SSW) are important factors that are significantly correlated with subsurface salinity. Therefore, we used the monthly average data of SST, SSW, ADT, and salinity data with a high resolution of 0.25° × 0.25° provided by the CNRM-CM6-1-HR model for two scenarios of HIST-1950 and HighRes-Future. The HIST-1950 data were used for the period between 1993 and 2014, while the HighRes-Future data were used between 2015 and 2018.

2.1.2. Synthetic Data

Resampling

The CNRM-CM6-1-HR model data were resampled according to the time and location of in situ observations to mimic real-world observations for constructing the ‘synthetic dataset’. The sea surface salinity (SSS), SST, SSW and ADT obtained by resampling from the CNRM-CM6-1-HR model are called ‘synthetic satellite remote sensing’ information. Resampled salinity data were taken as true values. The in situ ocean salinity observations used in this study were obtained from the World Ocean Database [31] (https://www.ncei.noaa.gov/products/world-ocean-database, accessed on 10 October 2021), including data obtained from all available instruments (i.e., Argo, Bottle, and conductivity–temperature–depth).

Noising of the Truth Value

The CNRM-CM6-1-HR model data are ‘perfect’. However, real-world observations contain inherent noise, which stems from instrumental error and representativeness error [32,33]. Therefore, the truth value of the resampled synthetic dataset was artificially noised to better approximate real-world observations and thus analyse the effect of noise on the reconstruction. Instrumental error is a major source of random error in observations. Random errors satisfy a normal distribution with a zero mean and fixed variance. The error associated with the representativeness can be obtained from the standard deviation of IAP1° salinity data (http://www.ocean.iap.ac.cn/, accessed on 20 September 2021) from the IAP. The truth values of the synthetic dataset were artificially noised as follows:
P N s u b x = P s u b x + 3 × I A P 1 _ s t d × ε
where P N s u b x , P s u b x , I A P 1 _ s t d , and ε represent the noised truth value, the original non-noised truth value, the standard deviation of the IAP1° salinity data in 1990–2015, and a random number in a normal distribution N (0, 1), respectively.
Figure 1 illustrates the resampling of the CNRM-CM6-1-HR data for January 2016 (Figure 1a) based on the spatial and temporal location of in situ observations (Figure 1b). Figure 1c shows the results of the resampling, and Figure 1d shows the spatial distribution of the resampled, noised salinity. January 2016 was chosen arbitrarily for illustration purposes; the same effect can be achieved with other times.
In this study, the datasets of the original and noised truth values are referred to as the ‘original synthetic dataset’ and the ‘noised synthetic dataset’, respectively.

Equivalence between the CNRM-CM6-1-HR and IAP1° Salinity Data

The coverage of salinity observation data is relatively sparse for real-world observations, and therefore does not reflect well the large-scale characteristics of the global salinity distribution. In addition, satellite sea surface observations do not closely reflect the deep-ocean salinity distributions [8,34]. Thus, we propose a data-reconstruction approach that can retain large-scale characteristics while improving the reconstruction accuracy for deep-ocean areas. Specifically, coarse-resolution data (for example, IAP1° data [10,18]) were added to the model input as the background field for reconstruction, and high-resolution remote-sensing sea surface observations were merged onto the background field. In this way, the reconstructed data not only inherit the large-scale information in the coarse-resolution data but also retain the small-scale information in satellite observations to maximise the accuracy of high-resolution (0.25° × 0.25°) data. In the present study, this merging process requires the salinity data output from CNRM-CM6-1-HR model to be equivalent to the IAP1° salinity data. The IAP1° data provide a global coverage of the oceans, at 1° × 1° horizontal resolution on 41 vertical levels from 1 m to 2000 m, and at a monthly resolution from 1940 to present. This product combines in situ salinity profiles with coupled model simulations from phase 5 of the Coupled Model Intercomparison Project to derive an objective analysis with the Ensemble Optimal Interpolation approach [10,18,35].
A wavenumber spectrum analysis, which can reveal mesoscale and small-scale fluctuations, was performed on the two salinity datasets to ensure that the IAP1° and CNRM-CM6-1-HR salinity data are equivalent. The equivalence between them in the physical background and spatial scale was ensured by assessing their degree of match in the wavenumber space before selecting the optimal equivalent treatment scheme. The Kuroshio region has rich mesoscale variation and contains information at different scales from small-scale to mesoscale. Therefore, we compared the wavenumber spectrum analysis results of the two datasets in the Kuroshio region and selected the equivalent treatment scheme. Figure 2 shows the wavenumber spectrum distribution in the Kuroshio region at 5 m. In Figure 2, IAP1, Model-IA, Model-sm, and Model-raw represent the wavenumber spectrum calculated (i) directly from the IAP1° data, (ii) by first smoothing the CNRM-CM6-1-HR model data using a 2.25° × 1.25° grid and then performing interpolation to a 1° × 1° grid, (iii) by smoothing the CNRM-CM6-1-HR model data using a 2.25° × 1.25° grid but not performing interpolation to a 1° × 1° grid, and (iv) derived from the CNRM-CM6-1-HR model data (0.25° × 0.25°). Comparing the wavenumber spectral lines, the Model-1A and IAP1 data are nearly consistent. Therefore, the Model-1A method was used to treat the 0.25° × 0.25° salinity data output from the CNRM-CM6-1-HR model and this helps obtain the IAP1°-equivalent salinity (equ-IAP1).

2.2. Method

2.2.1. FFNN

An FFNN is a unidirectional, multilayer network structure that comprises one input layer, hidden layers, and one output layer [36,37,38,39,40,41]. Layer 0 is referred to as the input layer, the last layer is referred to as the output layer, and the intermediate layers are referred to as the hidden layers. In an FFNN, neurons are arranged according to layers, and each neuron belongs to a specific layer. Furthermore, each neuron is connected to neurons in the previous and next layers. The information in an FFNN flows unidirectionally from the input layer to the hidden layers and then to the output layer, i.e., the output from the previous layer is input into the next layer, and the information in the next layer does not affect the previous layer. Each layer in the FFNN is equivalent to a function; the FFNN consists of interconnected multiple layers equivalent to a compound function, which reflects the linear or nonlinear mapping relationship between the input and output variables. The complexity of an FFNN depends on the number of neurons in each layer and the number of hidden layers; the larger the number of layers, the more complex the FFNN.
In the present study, a FFNN was implemented using Keras with a TensorFlow 2.0 backend [42]. The dataset was randomly divided into a training set and a test set at a ratio of 4:1. The training and test sets were used for model training and performance evaluation, respectively; the two datasets were completely independent of each other. The root-mean-square error (RMSE) was used as a measure of the model performance; the training iteration was stopped and the optimal parameter settings were saved when the RMSE of a model started to increase. The main hyperparameters considered for the FFNN included the number of hidden layers, the number of neurons in each layer, the learning rate, activation function, loss function regularization, and dropout. These parameters cannot be learned from the data and must be manually configured by the user. The FFNN structure was optimised using a grid search strategy [43]. A good reconstruction performance was obtained using the following hyperparameter settings: one input layer; one output layer; four hidden layers consisting of 256, 128, 64, and 32 neurons; an activation function: a rectified linear unit; and a learning rate of 0.001. The models were trained using Bayesian regularization and dropout methods [44], which helped maintain their generalizability and prevent overfitting. No validation sets were defined because the Bayesian regularization algorithm does not need cross-validation to ensure generalizability [28,29].

2.2.2. LightGBM

The LightGBM is a boosting-based ensemble learning algorithm and is considered to be a highly efficient implementation of the gradient boosting decision tree (GBDT) algorithm. The LightGBM algorithm maps the relationship between the input and output by dividing the parameters in the input layer into different parts [22,45]. In addition, it uses the negative gradient of the loss function to approximate the residual of the current decision tree, and it uses the approximate value to fit a new decision tree, i.e., the model remains unchanged and a new function is integrated into the model after each iteration, which results in an accurate approximation of the truth values based on predictions. Furthermore, the LightGBM uses a histogram algorithm, depth-constrained leaf growth strategy, gradient-based one-side sampling, and exclusive feature bundling to improve the overall accuracy rate and training speed of the algorithm; plus, the algorithm can automatically screen for effective data features. Compared with traditional algorithms such as GBDT, RF, and SVM, LightGBM yields higher accuracy and a faster prediction speed [46]. The LightGBM algorithm was implemented using the LGBMRegressor from Sklearn. The training and test sets (with the dataset randomly divided at a 4:1 ratio) for the FFNN method were directly applied to the LightGBM to ensure consistency between the training and test sets for the FFNN and LightGBM. The performance of the LightGBM model was measured using RMSE. The LightGBM approach comprises three major hyperparameters: (1) num_leaves, which controls the maximum number of leaves of a decision tree to prevent overfitting, which is the main parameter controlling the complexity of the tree model; (2) learning_rate, which controls the shrinkage of the algorithm; and (3) n_estimators, which controls the number of boosting rounds that will be performed. As the LightGBM uses decision trees as the learners, this can also be considered as the ‘number of trees’.
In the present study, the GridSearchCV function from the Scikit-Learn Python module was used to determine the optimal combination of hyperparameters. GridSearchCV uses the concepts of grid search and cross-validation to determine the optimal hyperparameters. The GridSearchCV function forms all possible combinations based on the specified range of hyperparameters, and it then cross-validates them to select the optimal combination of hyperparameters based on the validation score. The hyperparameters used in this study were num_leaves = 50, learning_rate = 0.02, and n_estimators = 10,000.

2.2.3. Design of the Data Reconstruction Experiments

Important factors significantly correlated with subsurface salinity were selected as input for model training [47,48,49]: spatial and temporal information (time, longitude, latitude, depth); equivalent IAP1 salinity; sea surface ADT; SST; and SSW. All inputs of the model training were converted to anomalies by subtracting their respective climatologies during 1993–2015 [49]; for example, equivalent IAP1° salinity anomaly (equ-IAP1SA), ADT anomaly (ADTA), SST anomaly (SSTA), and SSW anomaly (SSWA), which include zonal and meridional anomalies (USSWA/VSSWA). Data standardization can improve the speed of convergence of the FFNN [39]. Latitudes, longitudes, and time were subjected to sine/cosine transforms using methods proposed by Denvil-Sommer et al. [50]. The ADTA, SSTA, USSWA, VSSWA, equ-IAP1SA, and standard layer depths were treated using the Z-score standardization method. During the training of the FFNN and LightGBM models, the gridded salinity anomalies were considered to be the ‘truth values.’ The input X can be expressed as
X = (longitude, latitude, time, depth, equ-IAP1SA, ADTA, SSTA, USSWA, VSSWA)
Four experiments were performed using the two typical ML algorithms (FFNN and LightGBM) and two different vertical layering schemes (divided into 41 standard layers and non-layered). These experiments are referred to in this paper as Case NN, Case LG, Case NNL, and Case LGL, respectively. The four experiments were used to train the model on the original synthetic and noised synthetic datasets, and eight reconstructed models were obtained. Ultimately, eight datasets were reconstructed. The details of the experiments are summarised in Table 1.

3. Reconstruction Results

3.1. Reconstruction of Geographical Pattern

Eight datasets of the global salinity gridded product were reconstructed at a horizontal resolution of 0.25° × 0.25°, over 41 vertical levels from 1 m to 2000 m, and with a monthly resolution from 1993 to 2018, using the schemes presented in Table 1. The spatial distribution of the reconstructed data at three typical depths (10, 100, and 800 m) in January 2016 and the density distributions of the reconstructed and truth values were analysed to evaluate the quality of the reconstructed data (Figure 3, Figure 4 and Figure 5). The spatial distributions indicate that the large-spatial-scale distributions of the reconstructed salinity fields at the three typical layers of the eight reconstructed datasets are consistent with those of the truth values. They show no abrupt excursion, which indicates that the reconstruction models are robust and effective. However, the small-scale signals of the reconstructed fields are weaker than those of the truth values, which can be attributed to the fact that the equivalent IAP1° data smooth out some small-scale information during reconstruction.
On the basis of the RMSEs and Pearson correlation coefficients (CCs) at the three typical layers, Case LG slightly underperformed in the reconstruction at 800 m depth; the CCs of the reconstructions at the three typical layers yielded by other experiments were greater than 0.85; and the RMSEs were lower than 0.15 psu. This indicates that the merging of the small-scale information in satellite sea surface data onto a large-scale background field yielded reconstructed data with good prediction accuracy.
In the density scatter distributions, the completely overlapping diagonal lines represent complete reconstruction, i.e., the reconstructions are completely consistent with the truth values. Any error in the reconstruction can lead to a more dispersed density distribution. The LightGBM-based Case LGL yielded data points that were more concentrated near the diagonal line, and, therefore, it outperformed the FFNN-based Case NN and Case NNL. The addition of noise into the truth values resulted in more dispersed distributions of data points and the degradation of the corresponding RMSEs and CCs. At 10 m, the CC of Case NN decreased from 0.888 to 0.833, and the RMSE increased from 0.106 psu to 0.133 psu. At 100 m, the RMSE of Case LGL increased from 0.052 psu to 0.061 psu, and the CC decreased from 0.929 to 0.905, which indicates a degradation in the reconstruction performance.

3.2. Reconstruction of Vertical Structure

In addition to evaluating the spatial distribution of the reconstructed data, it is necessary to analyse the vertical structure of the reconstructed salinity fields to obtain a clear understanding of the extent to which the vertical structure can be reconstructed by the different experiments. The vertical structure along 35.375°N of the Gulf Stream and Kuroshio extension with active mesoscale eddies in January 2016 were analysed.
Figure 6 illustrates the distribution of ADTA, sea surface salinity anomalies (SSSA), SSTA, and the vertical section along 35.375°N of the resampled salinity anomalies, equivalent IAP1° subsurface salinity anomalies, reconstructed data, and the truth value in the Gulf Stream region in January 2016. The equivalent large-scale IAP1° salinity in Figure 6c shows a smoother distribution compared with the truth values, which indicates the loss of many small-scale signals. However, Figure 6e–l confirms that the overall distribution patterns of the reconstructed salinity fields are consistent with those of the truth values. Furthermore, the small-scale signal is more abundant than that in Figure 6c, which suggests that small-scale variations in satellite data and small-scale information along the section revealed by the resampled in situ observations are well reproduced. For example, in Figure 6b, a positive salinity anomaly in the upper 800 m in the 46°–47°W region exhibited by the resampled observation data is well reproduced in the reconstructed data and not in the equivalent IAP1° data. This suggests that merging small-scale remote-sensing signals onto coarse-resolution data can help reflect small-scale characteristics accurately by absorbing remote-sensing information. The ADTA and SSTA are negative at approximately 60°W, and they are produced in the reconstructed data. In the 40°–58°W region, the remote-sensing data showed pronounced fluctuations, and the reconstructed fields exhibited corresponding zigzag fluctuations, which suggests that the reconstructed data absorbed the ADT and SST information well. The depth with positive salinity anomalies in the reconstructed data is smaller than that in the truth values. For example, in the 70°–75°W region, the truth values exhibited positive salinity anomalies at a depth of 1–1000 m, whereas the reconstructed data exhibited positive salinity anomalies only at a depth of 1–900 m. In the 55°–58°W region, the reconstructed fields exhibited positive salinity anomalies at a depth of 1–600 m, whereas the truth values exhibited positive anomalies at a depth of 1–800 m. This was attributed to the physical phenomena in the ocean being more difficult to capture using sea surface data with increasing depth, resulting in a weak correlation between the satellite sea surface observations and deep-ocean salinity.
Figure 7 shows the vertical distribution of multiple parameters in the Kuroshio and its extension along 35.375°N. Figure 7a shows that the synthetic satellite SSSA data (which are independent data that do not participate in model training) exhibit extensive positive salinity anomalies, and the reconstructed data also exhibit widespread salinity anomalies near the sea surface, which indicates that the data are reconstructed well for the near-surface signals. Like the reconstructed fields for the Gulf Stream region shown in Figure 6, the reconstructed data absorb the small-scale signals of the remote-sensing signals and the large-scale information of equivalent IAP1°. The small-scale information is produced along the vertical section of the reconstructed fields. For example, the reconstructed fields in the 150°–170°E region exhibited zigzag distribution patterns for positive and negative salinity anomalies, which are an evident characteristic of small-scale signals.
The salinity reconstructed by Case NNL exhibits a pronounced, discontinuous, layered vertical structure for the original synthetic dataset (Figure 6g and Figure 7g). For the noised synthetic dataset, the salinity reconstructed by Case LGL also exhibits a discontinuous, layered vertical structure (Figure 6l and Figure 7l). This indicates that Case NNL and Case LGL do not adequately reflect the vertical structure of salinity fields. Furthermore, the discontinuous, layered vertical structure is more pronounced in the 1–200 m depth range because this range is divided into 17 standard layers with smaller intervals between the layers. In the pre-Argo period, the discontinuous, layered vertical structure was more pronounced because of fewer in situ observations. Figure 8 shows the vertical structure in the Gulf Stream region in January 1998 when in situ observations were not available in the Gulf Stream region. The salinity reconstructed by Case NNL and Case LGL exhibits vertical discontinuities in the entire Gulf Stream region and the 50–70°W region, respectively (Figure 8a,b). The addition of noise to the truth values made these vertical discontinuities even more pronounced (Figure 8f,g). In contrast, the data reconstructed by Case NN and Case LG do not exhibit such phenomena. This can be attributed to Case NNL and Case LGL dividing the vertical profile into 41 standard layers for modelling and the ML algorithms inadequately reconstructing the inter-layer connections during layered modelling, which results in inadequate learning of the physical relationship between adjacent layers in the reconstruction models for individual layers. Su et al. [45,48,51,52] adopted a layered modelling approach to reconstruct subsurface temperature and salinity; their reconstructed data do not exhibit such pronounced vertical discontinuities. This result can be explained as follows: (1) They used temperature (salinity) instead of the STA (SSA) for the analysis of the vertical structure, and the temperature (salinity) was considerably larger than the anomalies, which resulted in subtle variations of temperature (salinity) being covered up in the colour map; and (2) they used gridded continuous Argo observations as truth values, whereas sparse resampled observations were used as truth values in this study.
In summary, the salinity fields reconstructed by Case NNL and Case LGL showed discontinuities in the vertical structure as indicated by the vertical structure of the salinity data reconstructed by the different experiments. Furthermore, the addition of noise to the truth values increased these interlayer discontinuities. This further indicates that Case NNL and Case LGL do not adequately reflect the vertical structure of the salinity fields. Thus, layered modelling was not superior in the reconstruction of the global subsurface salinity from sparse in situ observations, and the problem of vertical discontinuities with layered modelling needs to be solved by considering the physical relationship between adjacent layers or by optimizing the ML approach.

3.3. Overall Reconstruction Performance

The distribution of the RMSEs (Figure 9) and CCs (Figure 10) between the reconstructed data of the different experiments and the salinity truth values of the CNRM-CM6-1-HR model were calculated to quantify the overall reconstruction performance of the different experiments. Figure 9a shows that Case LGL has the smallest RMSE for all 41 standard layers when reconstruction experiments are performed with the original synthetic dataset; the reconstructed data are the closest to the truth values. Case NNL has large RMSEs at most layers and shows the poorest reconstruction performance. The RMSEs of Case NN and Case LG are close for each layer. In the 1–100 m depth range, the RMSE of Case LG is smaller than that of Case NN. At the depth range of 100–2000 m (especially 800–2000 m), the RMSE of Case NN is smaller than that of Case LG. Thus, Case LG showed better reconstruction performance at the surface layers; however, the superiority of Case NN is more apparent as the depth increases. This finding is consistent with that of Stamell et al. [24]. Figure 9b shows that Case LG has the smallest RMSEs in the 1–100 m depth range when reconstruction experiments were performed with the noised synthetic dataset, whereas Case LGL shows the poorest performance. However, in the 100–2000 m depth range, the superiority of Case LGL is more apparent, whereas Case LG exhibits strong fluctuations in the 800–1800 m depth range, and the RMSE increases to 0.048 psu. Figure 9c–f shows that the addition of noise to the truth values increased the RMSEs of the four experiments at all layers. Case LG exhibited the largest RMSE degradations in the 800–1800 m depth range, whereas the other three experiments exhibited the largest RMSE degradations in the surface layers, which indicates that the noise in the truth values was the major source of the error in the reconstructed data and had a larger effect on the surface layers. Figure 9g shows that Case LGL has the smallest average RMSE in the 1–2000 m depth range, followed by Case NN, Case LG, and Case NNL. The addition of noise to the truth values led to different degrees of RMSE degradation in the four experiments. Figure 9h shows the RMSE degradation rates of the four experiments from the original synthetic dataset to the noised synthetic dataset. Here, the degradation rate is defined as the difference between the RMSE of the noised synthetic dataset and that of the original synthetic dataset divided by the RMSE of the original synthetic dataset. Case LGL, Case LG, Case NN, and Case NNL have noise-induced RMSE degradation rates of 15.5%, 12.2%, 12.0%, and 6.7%, respectively. In summary, the addition of noise to the truth values had a more significant impact on the reconstruction accuracy of the LightGBM than that of the FFNN. The addition of noise resulted in the largest performance degradation in Case LGL.
Figure 10 shows the CCs for the four experiments. Figure 10a,b shows that Case LGL has best reconstruction performance, followed by Case NNL, irrespective of whether the reconstruction experiments are performed with the original or the noised synthetic dataset. Case LG and Case NN performed extremely well in the 1–800 m depth range but underperformed in the depth range of 800–2000 m; the CCs of Case LG even dropped to ~0.3. In Figure 10c–f, the CCs of the four experiments decreased at almost all depth layers when the reconstruction experiments were performed based on the noised synthetic dataset. The CCs of Case NN and Case LGL at the surface layers decrease sharply by 0.14 and 0.30, respectively. Case LG exhibited significant decreases at the surface layers and in the 800–1800 m depth range, with the minimum value reaching ~0.1. Figure 10g shows the average CCs in the 1–2000 m depth range for the different experiments. Case LGL has the largest average CC in the 1–2000 m depth range, followed by Case NN, Case NNL, and Case LG. The addition of noise to the truth values led to different degrees of CC degradation for the four experiments, with the noise-induced CC degradation rates of Case LG, Case LGL, Case NN, and Case NNL being 6.4%, 4.3%, 3.5%, and 2.1%, respectively (Figure 10h). In summary, the addition of noise to the truth values had a more significant impact on the reconstruction accuracy of the LightGBM than on that of the FFNN. The addition of noise caused the largest CC degradation in Case LG.
Figure 11 shows the distribution of the average RMSEs and CCs over the 1–2000 m depth range and the RMSEs and CCs at the three typical layers of the four experiments. Table 2 lists the values of the average RMSEs and CCs in the 1–2000 m depth range. Figure 11a and Table 2 suggest that Case LGL always shows the best performance, followed by Case NN irrespective of whether the experiments are performed with the original or the noised synthetic dataset. Furthermore, Case NNL has the largest RMSEs, whereas Case LG has the smallest CC. As shown by the RMSEs and CCs at the three typical layers in Figure 11b–d, Case LGL shows the best reconstruction performance at all three typical layers when performed with the original synthetic dataset; however, it has the smallest CC at a depth of 10 m when performed with the noised synthetic dataset. The noise-induced RMSE and CC degradation rates listed in Table 2 indicate that the LightGBM method-based Case LG and Case LGL have larger noise-induced RMSE and CC degradation rates compared with those of the FFNN method-based Case NN and Case NNL.
The LightGBM method-based Case LGL exhibited greater noise-induced RMSE and CC degradation rates. However, this does not indicate a better overall performance for the other three experiments. Despite the significant performance degradations, Case LGL still has a relatively small RMSE and good CC when performed with the noised synthetic dataset (Figure 9g and Figure 10g). This is because Case LGL exhibits good enough performance when carried out with the original synthetic dataset to guarantee that it can outperform other experiments with the noised synthetic dataset even if the noise-induced degradation is larger. However, Case LGL will no longer be superior if the noise is increased further because of the further noise-induced performance degradations, as indicated in Figure 12. With an increase in noise ( 5 × I A P 1 _ s t d × ε noise added to the true value), Case LG and Case LGL show noise-induced RMSE degradation rates of 18.1% and 31.7% (Figure 12b) and CC degradation rates of 9.69% and 8.46% (Figure 12d), respectively. Case LGL on a large-noise synthetic dataset has an RMSE of 0.042 psu and a CC of 0.84, and it shows a poorer reconstruction performance than that of Case NN (Figure 12a,c). Thus, the FFNN approach exhibits greater generalizability and is more suitable for large-noise datasets compared with the LightGBM approach [53].
Figure 13 shows the time series of the global average RMSEs for the 1–2000 m depth range and the RMSEs at the three layers of the four experiments performed with the original (left panels) and the noised (right panels) synthetic datasets. The RMSE time series of the four experiments did not show pronounced interannual variation; however, the addition of noise to the truth values resulted in an increase in the average RMSEs in the 1–2000 m depth range and the RMSEs at the three layers. Case NNL shows the largest RMSE, irrespective of whether it was performed with the original or noised synthetic dataset. Case LGL has the smallest average RMSE in the 1–2000 m depth range when performed with the original synthetic dataset; however, it exhibits no obvious superiority compared with Case LG on the noised synthetic dataset. Based on the RMSE time series of the three typical layers, the superiority of Case LGL increased with the depth, especially when performed with the noised synthetic dataset. The RMSE at the surface layer (10 m depth) of Case LGL was only smaller than that of Case NNL. However, at 800 m, the RMSE of Case LGL is already the best among the four experiments.
Figure 14 shows the time series of the global average CCs in the 1–2000 m depth range and at the three layers of the four experiments performed with the original (left panels) and the noised (right panels) synthetic datasets. The 1–2000 m average CCs of Case NN, Case NNL, and Case LGL showed an upward trend with increasing time, attributable to the increase in the in situ salinity observations in recent years, which have further improved the accuracy of the reconstructed data [53,54,55,56,57]. In contrast, the CCs of Case LG fluctuated significantly after 2005, which indicates the instability of the reconstruction. Although this phenomenon cannot be attributed to a specific issue, it is possibly related to the ML method itself. The addition of noise to the truth values resulted in a decrease in the global average CCs in the 1–2000 m depth range and at the three layers. The average CCs of Case LGL were greater than 0.85 when performed with the original synthetic dataset whereas they were less than 0.85 when performed with the noised synthetic dataset. Despite the considerable noised-induced degradation, Case LGL still outperformed the other three experiments in terms of CCs when performed on the noised synthetic dataset. Case NN and Case NNL showed similar performance levels, whereas Case LG exhibited pronounced fluctuations, which occurred after 2005 at depths above 800 m. These patterns of CC variation are consistent with the CC vertical profiles shown in Figure 10.
Figure 15 and Figure 16 show the global spatial distributions of the average RMSEs and CCs in the 1–2000 m depth range of the four experiments. The spatial distributions of the RMSEs and CCs indicate differences between the four experiments in their capacity to accurately reconstruct subsurface salinity in different regions. Figure 15 indicates that the RMSE is lowest in the open ocean, particularly in the equatorial Pacific, and larger in the western boundary currents, such as the Kuroshio and the Gulf Stream, and the Antarctic Circumpolar Current (ACC), with strong mesoscale variation in these regions [58,59,60]. In addition to this, the RMSEs are large in the Arctic Ocean and along the coast of the Southern Ocean possibly because of the sparsity of in situ observations in these regions [7,8,9]. Case LGL showed the largest low-RMSE area in the equatorial Pacific Ocean and a significantly smaller high-RMSE area in the ACC region compared with the three other experiments when the reconstruction experiments were performed with the original synthetic dataset. The low-RMSE area of Case NN was only smaller than that of Case LGL. In contrast, with the noised synthetic dataset, the low-RMSE area of all experiments reduced significantly.
Figure 16 shows that the CCs are highest in the Pacific Ocean and the North Atlantic. There are large low-CC areas along the eastern coast of the Pacific Ocean and the coast of the Southern Ocean because the salinity variability in these regions is strongly affected by river runoff, surface fluxes, and ocean currents [36,61,62]. Among the four experiments, the CCs of Case LGL are significantly better than those of the other experiments in the open ocean. However, the addition of noise to the truth values resulted in a decrease in the high-CC area.
Thus, the vertical distributions of the two statistical metrics (RMSE and CC) indicate that Case LGL has the best performance and Case NNL the worst performance when using the original synthetic dataset. LightGBM-based Case LG and Case LGL exhibit a large degradation with the noised synthetic dataset, which indicates poor robustness. In contrast, Case NN shows stable performance irrespective of whether it is performed with the original or the noised synthetic dataset [53]. The trends of the two statistical metrics suggest that the more in situ observation after 2005 can improve the RMSE and CC. On the basis of the spatial distributions of the two statistical metrics, the four experiments show the best reconstruction performance in the open ocean and poorest performance in the western boundary currents (such as the Kuroshio and the Gulf Stream), the ACC, and other near-shore areas. In these areas, salinity changes are strongly affected by river runoff, surface fluxes, and ocean currents [61,63,64].

3.4. Evaluation of Spatial Patterns of Long-Term Salinity Changes

The thermohaline circulation spans the globe, transporting mass and energy between low and high latitudes, and has a positive effect on global climate change; thus, it is important to evaluate the long-term salinity variation trends [18,25,65,66]. However, it is unclear whether the large-scale pattern of salinity change associated with climate change can be well represented in the new reconstruction.
In this study, we used a linear trend in mean salinity between 1 m and 2000 m (S2000) from 1993 to 2018 over the globe to describe the same. The reconstructed data show an overall linear trend consistent with that of the truth values. The differences between the linear trend of the truth value and that of each reconstructed experiment are presented in graphical form to better represent the differences in the trend (Figure 17a–h). The reconstructed data yielded by Case LGL were closest to the truth values in linear trend when performed with the original synthetic dataset, whereas those yielded by Case NN exhibited a weaker linear trend compared with the truth values in most ocean areas. The linear trend of Case LGL was further weakened when performed with noised data, and in the Southern Ocean from 0° to 180° longitude, the linear trend of Case NN was significantly stronger than that of the truth value.
Despite such differences in the trends, the reconstructed data exhibited variation characteristics consistent with those of the truth values. For example, the freshening trend in the high-latitude region in the North Atlantic and the salting trend in other regions of this ocean basin were reflected well. This is consistent with the finding that global warming has led to a decrease in sea surface salinity in the North Atlantic, which implies that the reconstructed data reflect well the climatic characteristics of long-term salinity variations [65].

4. Conclusions

CNRM-CM6-1-HR model data were resampled using in situ observation data, and equivalent coarse-resolution IAP1° salinity data were established by wavenumber spectrum analysis as the background field for data reconstruction. Furthermore, the truth value was artificially perturbed, which created a ‘synthetic dataset’ from the model data. On the basis of this ‘synthetic dataset’, four experiments were designed to reconstruct subsurface salinity data by adopting two different ML approaches (FFNN and LightGBM) and two different vertical layering schemes (divided into 41 layers and non-layered). The advantages and disadvantages of the different reconstruction schemes were compared, and the applicability of the FFNN and LightGBM approaches in reconstructing ocean salinity data with global coverage from sparse data was examined. The major findings are summarised as follows.
(1) When different experiments were performed with the original synthetic dataset, Case LGL had the smallest average RMSE over 1–2000 m (0.032 psu), followed by Case NN, Case LG, and Case NNL. The CCs of the experiments showed similar characteristics to the RMSEs. The performance of Case NN in terms of RMSE and CC was not as good as that of Case LGL; however, it was considerably better than those of Case NNL and Case LG. Case LGL divided the global ocean into 41 layers, with one model for each layer, and the reconstructed fields showed discontinuities in the vertical structure. This can be attributed to the inadequate treatment of the physical relationship between adjacent layers during the ML-based layered modelling, which results in inadequate consideration of the correlations and continuous variation characteristics between adjacent layers in the reconstruction experiments for individual layers. Therefore, although Case LGL showed the best reconstruction performance, as measured by RMSE and CC, it exhibited poorer overall performance compared with Case NN because of the inability to accurately reflect the vertical distribution characteristics of salinity. Case NN was inferred to be suitable for reconstructing all depth layers of the global ocean.
(2) Compared with the reconstruction performance with the original synthetic dataset, all four experiments exhibited degraded reconstruction performance with the noised synthetic dataset. The LightGBM-based Case LGL exhibited the greatest degradation rates in both statistical metrics, with the RMSE and CC degrading by 15.5% and 4.3%, respectively. However, Case LGL still showed a better performance compared with the three other experiments because it showed a sufficiently good performance when used with the original synthetic dataset. Furthermore, despite the larger noise-induced degradations, it outperformed its counterparts when performed with the noised synthetic dataset. With larger noise added to the truth values (for example, with the truth values mixed with 5 × I A P 1 _ s t d × ε noise), the greater vulnerability of Case LGL to noise-induced performance degradation eliminates its superiority over Case NN. This suggests that the LightGBM method is less suitable than the NN method for large-noise, sparse datasets [53]. Case NN showed relatively better stability in reconstruction performance when performed with a noised synthetic dataset, and better generalizability; therefore, it was found to be more suitable for observation datasets with large noise.
(3) The CCs of the reconstructed subsurface salinity data increased significantly after 2005 because of the increased in situ observations from the Argo project. The CC of Case NN increased from ~0.8 to 0.85. The statistics showed that the average sampling frequency of salinity observations within the 0.25° grid was less than 10 months from 1993 to 2018 (with a sampling rate below 5%). The inadequate sampling of global ocean observations is essentially a ‘wall’ that prevents further improvement in reconstruction skills [67,68]. In other words, the reconstruction skill of the current optimal methods cannot be significantly improved because of the prohibitive data sparsity. Data sparsity remains a major challenge in accurately reconstructing global subsurface salinity.
In summary, Case NN achieved the best results in the reconstruction of the global sea subsurface salinity data from synthetic data; the reconstructed data reflected well the spatial modality and vertical distribution characteristics of salinity. It further exhibited a low RMSE and a high CC in the evaluation. Moreover, Case NN showed stable reconstruction performance when the noised synthetic dataset was employed, which indicates good generalizability. Therefore, the data reconstructed by Case NN not only maintain large-scale modality but also show realistic spatial signals in the Gulf Stream, Kuroshio, and ACC regions with more strong mesoscale information than the coarse-resolution 1° × 1° product owing to the high-resolution remote sensing sea surface observations being merged onto the coarse-resolution background field. The results show that Case NN can effectively transfer small-scale spatial variations in the sea surface remote-sensing fields into the 0.25° × 0.25° salinity field. Our results provide methodological references for the reconstruction of global coverage high-resolution subsurface salinity data based on real-world observations.

Author Contributions

Conceptualization, T.T. and H.L.; data curation, G.W.; formal analysis, T.T. and G.W.; funding acquisition, H.L. and J.Z.; investigation, H.L. and Y.A.; methodology, T.T. and J.Z.; project administration, J.Z.; resources, H.L. and J.Z.; software, T.T. and G.W.; supervision, H.L. and J.S.; validation, G.L. and Y.A.; visualization, G.W. and G.L.; writing—original draft, T.T. and J.S.; writing—review & editing, T.T. and H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Strategic Priority Research Program of the Chinese Academy of Sciences, grant number XDB42040402 and the National Natural Science Foundation of China, grant numbers 42122046, 42076202 and 41605070.

Data Availability Statement

Not applicable.

Acknowledgments

Many thanks are extended to the institutions or organizations that provided the data used in this paper. IAP1° dataset (http://www.ocean.iap.ac.cn/, accessed on 20 September 2021); In situ ocean salinity observations are obtained from the Word Ocean Database (https://www.ncei.noaa.gov/products/world-ocean-database, accessed on 10 October 2021); CMIP6 data (https://esgf-node.llnl.gov/search/cmip6/, accessed on 5 September 2021). We also thank all anonymous reviewers for their detailed and constructive comments.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Lyman, J.M.; Johnson, G.C. Estimating Global Ocean Heat Content Changes in the Upper 1800 m since 1950 and the Influence of Climatology Choice. J. Clim. 2014, 27, 1945–1957. [Google Scholar] [CrossRef]
  2. Durack, P.J.; Gleckler, P.J.; Landerer, F.W.; Taylor, K.E. Quantifying Underestimates of Long-Term Upper-Ocean Warming. Nat. Clim. Chang. 2014, 4, 999–1005. [Google Scholar] [CrossRef]
  3. Ciais, P.; Sabine, C.; Bala, G.; Bopp, L.; Brovkin, V.; Canadell, J.; Chhabra, A.; DeFries, R.; Galloway, J.; Heimann, M.; et al. Climate Change 2013—The Physical Science Basis; Contribution of Working Group I to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar] [CrossRef] [Green Version]
  4. Domingues, C.M.; Church, J.A.; White, N.J.; Gleckler, P.J.; Wijffels, S.E.; Barker, P.M.; Dunn, J.R. Improved Estimates of Upper-Ocean Warming and Multi-Decadal Sea-Level Rise. Nature 2008, 453, 1090–1093. [Google Scholar] [CrossRef] [PubMed]
  5. Bagnell, A.; DeVries, T. 20th Century Cooling of the Deep Ocean Contributed to Delayed Acceleration of Earth’s Energy Imbalance. Nat. Commun. 2021, 12, 4604. [Google Scholar] [CrossRef] [PubMed]
  6. Li, G.; Zhang, Y.; Xiao, J.; Song, X.; Abraham, J.; Cheng, L.; Zhu, J. Examining the Salinity Change in the Upper Pacific Ocean during the Argo Period. Clim. Dyn. 2019, 53, 6055–6074. [Google Scholar] [CrossRef] [Green Version]
  7. Durack, P.J. Ocean Salinity and the Global Water Cycle. Oceanography 2015, 28, 20–31. [Google Scholar] [CrossRef] [Green Version]
  8. Guinehut, S.; Dhomps, A.L.; Larnicol, G.; Le Traon, P.Y. High Resolution 3-D Temperature and Salinity Fields Derived from in Situ and Satellite Observations. Ocean Sci. 2012, 8, 845–857. [Google Scholar] [CrossRef] [Green Version]
  9. Johnson, G.C.; Hosoda, S.; Jayne, S.R.; Oke, P.R.; Riser, S.C.; Roemmich, D.; Suga, T.; Thierry, V.; Wijffels, S.E.; Xu, J. Argo-Two Decades: Global Oceanography, Revolutionized. Ann. Rev. Mar. Sci. 2022, 14, 379–403. [Google Scholar] [CrossRef]
  10. Cheng, L.; Zhu, J. Benefits of CMIP5 Multimodel Ensemble in Reconstructing Historical Ocean Subsurface Temperature Variations. J. Clim. 2016, 29, 5393–5416. [Google Scholar] [CrossRef]
  11. Ishii, M.; Kimoto, M.; Kachi, M. Historical Ocean Subsurfaces Temperature Analysis with Error Estimates. Mon. Weather Rev. 2003, 131, 51–73. [Google Scholar] [CrossRef]
  12. Levitus, S.; Antonov, J.; Boyer, T. Warming of the World Ocean, 1955–2003. Geophys. Res. Lett. 2005, 32. [Google Scholar] [CrossRef] [Green Version]
  13. Lyman, J.M.; Johnson, G.C. Estimating Annual Global Upper-Ocean Heat Content Anomalies despite Irregular In Situ Ocean Sampling. J. Clim. 2008, 21, 5629–5641. [Google Scholar] [CrossRef]
  14. Willis, J.K.; Roemmich, D.; Cornuelle, B. Interannual Variability in Upper Ocean Heat Content, Temperature, and Thermosteric Expansion on Global Scales. J. Geophys. Res. Ocean. 2004, 109. [Google Scholar] [CrossRef] [Green Version]
  15. Good, S.A.; Martin, M.J.; Rayner, N.A. EN4: Quality Controlled Ocean Temperature and Salinity Profiles and Monthly Objective Analyses with Uncertainty Estimates. J. Geophys. Res. Ocean. 2013, 118, 6704–6716. [Google Scholar] [CrossRef]
  16. Levitus, S.; Antonov, J.I.; Boyer, T.P.; Baranova, O.K.; Garcia, H.E.; Locarnini, R.A.; Mishonov, A.V.; Reagan, J.R.; Seidov, D.; Yarosh, E.S.; et al. World Ocean Heat Content and Thermosteric Sea Level Change (0–2000 m), 1955–2010. Geophys. Res. Lett. 2012, 39. [Google Scholar] [CrossRef] [Green Version]
  17. Ishii, M.; Kimoto, M. Reevaluation of Historical Ocean Heat Content Variations with Time-Varying XBT and MBT Depth Bias Corrections. J. Oceanogr. 2009, 65, 287–299. [Google Scholar] [CrossRef]
  18. Cheng, L.; Trenberth, K.E.; Fasullo, J.; Boyer, T.; Abraham, J.; Zhu, J. Improved Estimates of Ocean Heat Content from 1960 to 2015. Sci. Adv. 2017, 3, e1601545. [Google Scholar] [CrossRef] [Green Version]
  19. Jiang, M.; Zhu, Z. The Role of Artificial Intelligence Algorithms in Marine Scientific Research. Front. Mar. Sci. 2022, 9, 781. [Google Scholar] [CrossRef]
  20. Lou, R.; Lv, Z.; Dang, S.; Su, T.; Li, X. Application of Machine Learning in Ocean Data. Multimed. Syst. 2021. [Google Scholar] [CrossRef]
  21. Radin, C.; Nieves, V. Machine-Learning Based Reconstructions of Past Regional Sea Level Variability from Proxy Data. Geophys. Res. Lett. 2021, 48, e2021GL095382. [Google Scholar] [CrossRef]
  22. Gan, M.; Pan, S.; Chen, Y.; Cheng, C.; Pan, H.; Zhu, X. Application of the Machine Learning LightGBM Model to the Prediction of the Water Levels of the Lower Columbia River. J. Mar. Sci. Eng. 2021, 9, 496. [Google Scholar] [CrossRef]
  23. Foster, D.; Gagne II, D.J.; Whitt, D.B. Probabilistic Machine Learning Estimation of Ocean Mixed Layer Depth from Dense Satellite and Sparse In Situ Observations. J. Adv. Model. Earth Syst. 2021, 13, e2021MS002474. [Google Scholar] [CrossRef] [PubMed]
  24. Stamell, J.; Rustagi, R.; Gloege, L.; McKinley, G. Strengths and weaknesses of three Machine Learning methods for pCO2 interpolation. Geosci. Model Dev. Discuss. 2020, 2020, 1–25. [Google Scholar] [CrossRef]
  25. Hu, S.; Liu, H.; Zhao, W.; Shi, T.; Hu, Z.; Li, Q.; Wu, G. Comparison of Machine Learning Techniques in Inferring Phytoplankton Size Classes. Remote Sens. 2018, 10, 191. [Google Scholar] [CrossRef] [Green Version]
  26. Hu, S.; Zhou, W.; Wang, G.; Cao, W.; Xu, Z.; Liu, H.; Wu, G.; Zhao, W. Comparison of Satellite-Derived Phytoplankton Size Classes Using In-Situ Measurements in the South China Sea. Remote Sens. 2018, 10, 526. [Google Scholar] [CrossRef] [Green Version]
  27. Chen, S.; Hu, C.; Barnes, B.B.; Wanninkhof, R.; Cai, W.-J.; Barbero, L.; Pierrot, D. A Machine Learning Approach to Estimate Surface Ocean PCO2 from Satellite Measurements. Remote Sens. Environ. 2019, 228, 203–226. [Google Scholar] [CrossRef]
  28. Wang, H.; Song, T.; Zhu, S.; Yang, S.; Feng, L. Subsurface Temperature Estimation from Sea Surface Data Using Neural Network Models in the Western Pacific Ocean. Mathematics 2021, 9, 852. [Google Scholar] [CrossRef]
  29. Lu, W.; Su, H.; Yang, X.; Yan, X.H. Subsurface Temperature Estimation from Remote Sensing Data Using a Clustering-Neural Network Method. Remote Sens. Environ. 2019, 229, 213–222. [Google Scholar] [CrossRef]
  30. Voldoire, A. CNRM-CERFACS CNRM-CM6-1-HR Model Output Prepared for CMIP6 ScenarioMIP. Earth Syst. Grid Fed. 2019. [Google Scholar] [CrossRef]
  31. Boyer, T.P.; Baranova, O.K.; Coleman, C.; Garcia, H.E.; Grodsky, A.; Locarnini, R.A.; Mishonov, A.V.; Paver, C.R.; Reagan, J.R.; Seidov, D.; et al. World Ocean Database 2018. Mishonov, A.V., Technical Editor, NOAA Atlas NESDIS 87. 2018; pp. 1–207. Available online: https://www.ncei.noaa.gov/sites/default/files/2020-04/wod_intro_0.pdf (accessed on 15 May 2022).
  32. Gouretski, V.; Cheng, L. Correction for Systematic Errors in the Global Dataset of Temperature Profiles from Mechanical Bathythermographs. J. Atmos. Ocean. Technol. 2020, 37, 841–855. [Google Scholar] [CrossRef] [Green Version]
  33. Abraham, J.P.; Cowley, R.; Cheng, L. Quantification of the Effect of Water Temperature on the Fall Rate of Expendable Bathythermographs. J. Atmos. Ocean. Technol. 2016, 33, 1271–1284. [Google Scholar] [CrossRef]
  34. Klemas, V.; Yan, X.H. Subsurface and Deeper Ocean Remote Sensing from Satellites: An Overview and New Results. Prog. Oceanogr. 2014, 122, 1–9. [Google Scholar] [CrossRef]
  35. Cheng, L.; Trenberth, K.E.; Gruber, N.; Abraham, J.P.; Fasullo, J.T.; Li, G.; Mann, M.E.; Zhao, X.; Zhu, J. Improved Estimates of Changes in Upper Ocean Salinity and the Hydrological Cycle. J. Clim. 2020, 33, 10357–10381. [Google Scholar] [CrossRef]
  36. Abdar, M.; Pourpanah, F.; Hussain, S.; Rezazadegan, D.; Liu, L.; Ghavamzadeh, M.; Fieguth, P.; Cao, X.; Khosravi, A.; Acharya, U.R.; et al. A Review of Uncertainty Quantification in Deep Learning: Techniques, Applications and Challenges. Inf. Fusion 2021, 76, 243–297. [Google Scholar] [CrossRef]
  37. Contractor, S.; Roughan, M. Efficacy of Feedforward and LSTM Neural Networks at Predicting and Gap Filling Coastal Ocean Timeseries: Oxygen, Nutrients, and Temperature. Front. Mar. Sci. 2021, 8, 368. [Google Scholar] [CrossRef]
  38. Gabella, M. Topology of Learning in Feedforward Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 3588–3592. [Google Scholar] [CrossRef] [PubMed]
  39. LeCun, Y.A.; Bottou, L.; Orr, G.B.; Müller, K.R. Efficient Backprop. In Lecture Notes in Computer Science (including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Montavon, G., Orr, G.B., Müller, K.-R., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; Volume 7700, ISBN 9783642352881. [Google Scholar]
  40. Moussa, H.; Benallal, M.A.; Goyet, C.; Lefèvre, N. Satellite-Derived CO2 Fugacity in Surface Seawater of the Tropical Atlantic Ocean Using a Feedforward Neural Network. Int. J. Remote Sens. 2016, 37, 580–598. [Google Scholar] [CrossRef]
  41. Vikas Gupta. Understanding Feedforward Neural Networks. Learn Open CV 2017, 1. Available online: https://learnopencv.com/understanding-feedforward-neural-networks/ (accessed on 15 June 2022).
  42. Keras: Deep Learning Library for Theano and TensorFlow. Available online: https://faroit.com/keras-docs/1.1.1/ (accessed on 20 June 2022).
  43. Liashchynskyi, P.; Liashchynskyi, P. Grid Search, Random Search, Genetic Algorithm: A Big Comparison for NAS. arXiv 2019, arXiv:1912.06059. [Google Scholar]
  44. Dan Foresee, F.; Hagan, M.T. Gauss-Newton Approximation to Bayesian Learning. In Proceedings of the IEEE International Conference on Neural Networks—Conference Proceedings, Houston, TX, USA, 12 June 1997; Volume 3, pp. 1930–1935. [Google Scholar]
  45. Su, H.; Lu, X.; Chen, Z.; Zhang, H.; Lu, W.; Wu, W. Estimating Coastal Chlorophyll-A Concentration from Time-Series OLCI Data Based on Machine Learning. Remote Sens. 2021, 13, 576. [Google Scholar] [CrossRef]
  46. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U., Von Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
  47. He, Z.; Wang, X.; Wu, X.; Chen, Z.; Chen, J. Projecting Three-Dimensional Ocean Thermohaline Structure in the North Indian Ocean from the Satellite Sea Surface Data Based on a Variational Method. J. Geophys. Res. Ocean. 2021, 126, e2020JC016759. [Google Scholar] [CrossRef]
  48. Su, H.; Zhang, T.; Lin, M.; Lu, W.; Yan, X.-H. Predicting Subsurface Thermohaline Structure from Remote Sensing Data Based on Long Short-Term Memory Neural Networks. Remote Sens. Environ. 2021, 260, 112465. [Google Scholar] [CrossRef]
  49. Meng, L.; Yan, C.; Zhuang, W.; Zhang, W.; Yan, X.H. Reconstruction of Three-Dimensional Temperature and Salinity Fields From Satellite Observations. J. Geophys. Res. Ocean. 2021, 126, e2021JC017605. [Google Scholar] [CrossRef]
  50. Denvil-Sommer, A.; Gehlen, M.; Vrac, M.; Mejia, C. LSCE-FFNN-v1: A Two-Step Neural Network Model for the Reconstruction of Surface Ocean PCO2 over the Global Ocean. Geosci. Model Dev. 2019, 12, 2091–2105. [Google Scholar] [CrossRef] [Green Version]
  51. Su, H.; Wang, A.; Zhang, T.; Qin, T.; Du, X.; Yan, X.H. Super-Resolution of Subsurface Temperature Field from Remote Sensing Observations Based on Machine Learning. Int. J. Appl. Earth Obs. Geoinf. 2021, 102, 102440. [Google Scholar] [CrossRef]
  52. Su, H.; Zhang, H.; Geng, X.; Qin, T.; Lu, W.; Yan, X.H. OPEN: A New Estimation of Global Ocean Heat Content for Upper 2000 Meters from Remote Sensing Data. Remote Sens. 2020, 12, 2294. [Google Scholar] [CrossRef]
  53. Fassnacht, F.E.; Hartig, F.; Latifi, H.; Berger, C.; Hernández, J.; Corvalán, P.; Koch, B. Importance of Sample Size, Data Type and Prediction Method for Remote Sensing-Based Estimations of Aboveground Forest Biomass. Remote Sens. Environ. 2014, 154, 102–114. [Google Scholar] [CrossRef]
  54. Chen, G.; Peng, L.; Ma, C. Climatology and Seasonality of Upper Ocean Salinity: A Three-Dimensional View from Argo Floats. Clim. Dyn. 2018, 50, 2169–2182. [Google Scholar] [CrossRef] [Green Version]
  55. Yan, H.; Wang, H.; Zhang, R.; Bao, S.; Chen, J.; Wang, G. The Inconsistent Pairs Between In Situ Observations of Near Surface Salinity and Multiple Remotely Sensed Salinity Data. Earth Space Sci. 2021, 8, e2020EA001355. [Google Scholar] [CrossRef]
  56. Wong, A.P.S.; Wijffels, S.E.; Riser, S.C.; Pouliquen, S.; Hosoda, S.; Roemmich, D.; Gilson, J.; Johnson, G.C.; Martini, K.; Murphy, D.J.; et al. Argo Data 1999–2019: Two Million Temperature-Salinity Profiles and Subsurface Velocity Observations from a Global Array of Profiling Floats. Front. Mar. Sci. 2020, 7, 700. [Google Scholar] [CrossRef]
  57. Roemmich, D.; Alford, M.H.; Claustre, H.; Johnson, K.; King, B.; Moum, J.; Oke, P.; Owens, W.B.; Pouliquen, S.; Purkey, S.; et al. On the Future of Argo: A Global, Full-Depth, Multi-Disciplinary Array. Front. Mar. Sci. 2019, 6. [Google Scholar] [CrossRef]
  58. Frenger, I.; Münnich, M.; Gruber, N.; Knutti, R. Southern Ocean Eddy Phenomenology. J. Geophys. Res. Ocean. 2015, 120, 7413–7449. [Google Scholar] [CrossRef] [Green Version]
  59. Chassignet, E.P.; Fox-Kemper, B.; Yeager, S.G.; Bozec, A. Sources and Sinks of Ocean Mesoscale Eddy Energy. CLIVAR Exch. CLIVAR Var. 2020, 18, 3–8. [Google Scholar]
  60. Rhines, P.B. Mesoscale Eddies. In Encyclopedia of Ocean Sciences, 3rd ed.; Cochran, J.K., Bokuniewicz, H.J., Yager, P.L., Eds.; Academic Press: Oxford, UK, 2019; pp. 115–127. ISBN 9780128130810. [Google Scholar]
  61. Skliris, N.; Marsh, R.; Josey, S.A.; Good, S.A.; Liu, C.; Allan, R.P. Salinity Changes in the World Ocean since 1950 in Relation to Changing Surface Freshwater Fluxes. Clim. Dyn. 2014, 43, 709–736. [Google Scholar] [CrossRef]
  62. Llovel, W.; Lee, T. Importance and Origin of Halosteric Contribution to Sea Level Change in the Southeast Indian Ocean during 2005–2013. Geophys. Res. Lett. 2015, 42, 1148–1157. [Google Scholar] [CrossRef]
  63. Adler, R.F.; Sapiano, M.R.P.; Huffman, G.J.; Wang, J.J.; Gu, G.; Bolvin, D.; Chiu, L.; Schneider, U.; Becker, A.; Nelkin, E.; et al. The Global Precipitation Climatology Project (GPCP) Monthly Analysis (New Version 2.3) and a Review of 2017 Global Precipitation. Atmosphere 2018, 9, 138. [Google Scholar] [CrossRef] [Green Version]
  64. Liu, Y.; Cheng, L.; Pan, Y.; Abraham, J.; Zhang, B.; Zhu, J.; Song, J. Climatological Seasonal Variation of the Upper Ocean Salinity. Int. J. Climatol. 2022, 42, 3477–3498. [Google Scholar] [CrossRef]
  65. Durack, P.J.; Wijffels, S.E. Fifty-Year Trends in Global Ocean Salinities and Their Relationship to Broad-Scale Warming. J. Clim. 2010, 23, 4342–4362. [Google Scholar] [CrossRef]
  66. Boyer, T.P.; Levitus, S.; Antonov, J.I.; Locarnini, R.A.; Garcia, H.E. Linear Trends in Salinity for the World Ocean, 1955–1998. Geophys. Res. Lett. 2005, 32. [Google Scholar] [CrossRef] [Green Version]
  67. Gregor, L.; Lebehot, A.D.; Kok, S.; Scheel Monteiro, P.M. A comparative assessment of the uncertainties of global surface ocean CO2 estimates using a machine-learning ensemble (CSIR-ML6 version 2019a)-Have we hit the wall? Geosci. Model Dev. 2019, 12, 5113–5136. [Google Scholar] [CrossRef] [Green Version]
  68. Gloege, L.; McKinley, G.A.; Landschützer, P.; Fay, A.R.; Frölicher, T.L.; Fyfe, J.C.; Ilyina, T.; Jones, S.; Lovenduski, N.S.; Rodgers, K.B.; et al. Quantifying Errors in Observationally Based Estimates of Ocean Carbon Sink Variability. Glob. Biogeochem. Cycles 2021, 35, e2020GB006788. [Google Scholar] [CrossRef]
Figure 1. Examples of the resampled data at 100 m depth for January 2016. (a) Salinity anomaly field of the CNRM-CM6-1-HR model. (b) In situ observations of salinity. (c) Salinity anomaly field resampled according to the location of in situ observations. (d) Noised resampled salinity anomaly field.
Figure 1. Examples of the resampled data at 100 m depth for January 2016. (a) Salinity anomaly field of the CNRM-CM6-1-HR model. (b) In situ observations of salinity. (c) Salinity anomaly field resampled according to the location of in situ observations. (d) Noised resampled salinity anomaly field.
Remotesensing 14 05650 g001
Figure 2. Wavenumber spectrum distributions in the Kuroshio and its extension region at 5 m depth.
Figure 2. Wavenumber spectrum distributions in the Kuroshio and its extension region at 5 m depth.
Remotesensing 14 05650 g002
Figure 3. Spatial distribution of reconstructed data and density scatter plots of reconstructed vs. truth values for the four experiments at 10 m depth in January 2016. (ad) and (il) Reconstruction based on the original synthetic dataset. (eh) and (mp) Reconstruction based on the noised synthetic dataset. (q) Truth value of CNRM-CM6-1-HR. The color-coded dots represent the number of samples.
Figure 3. Spatial distribution of reconstructed data and density scatter plots of reconstructed vs. truth values for the four experiments at 10 m depth in January 2016. (ad) and (il) Reconstruction based on the original synthetic dataset. (eh) and (mp) Reconstruction based on the noised synthetic dataset. (q) Truth value of CNRM-CM6-1-HR. The color-coded dots represent the number of samples.
Remotesensing 14 05650 g003
Figure 4. Spatial distribution of reconstructed data and density scatter plots of reconstructed vs. truth values for the four experiments at 100 m depth in January 2016. (ad) and (il) Reconstruction based on the original synthetic dataset. (eh) and (mp) Reconstruction based on the noised synthetic dataset. (q) Truth value of CNRM-CM6-1-HR. The color-coded dots represent the number of samples.
Figure 4. Spatial distribution of reconstructed data and density scatter plots of reconstructed vs. truth values for the four experiments at 100 m depth in January 2016. (ad) and (il) Reconstruction based on the original synthetic dataset. (eh) and (mp) Reconstruction based on the noised synthetic dataset. (q) Truth value of CNRM-CM6-1-HR. The color-coded dots represent the number of samples.
Remotesensing 14 05650 g004
Figure 5. Spatial distribution of reconstructed data and density scatter plots of reconstructed vs. truth values for the four experiments at 800 m depth in January 2016. (ad) and (il) Reconstruction based on the original synthetic dataset. (eh) and (mp) Reconstruction based on the noised synthetic dataset. (q) Truth value of CNRM-CM6-1-HR. The color-coded dots represent the number of samples.
Figure 5. Spatial distribution of reconstructed data and density scatter plots of reconstructed vs. truth values for the four experiments at 800 m depth in January 2016. (ad) and (il) Reconstruction based on the original synthetic dataset. (eh) and (mp) Reconstruction based on the noised synthetic dataset. (q) Truth value of CNRM-CM6-1-HR. The color-coded dots represent the number of samples.
Remotesensing 14 05650 g005
Figure 6. (a) Distribution of synthetic remote sensing: SSTA (left y axis), ADTA (right y axis), and SSSA (right y axis) at the sea surface. The vertical section along 35.375°N in the Gulf Stream region of (b) resampled salinity anomalies according to in situ observation, (c) equivalent IAP1° subsurface salinity anomalies, (d) truth value, (eh) reconstructed data from experiments performed with the original synthetic dataset, and (il) reconstructed data from experiments performed with the noised synthetic dataset. The fields for January 2016 are shown.
Figure 6. (a) Distribution of synthetic remote sensing: SSTA (left y axis), ADTA (right y axis), and SSSA (right y axis) at the sea surface. The vertical section along 35.375°N in the Gulf Stream region of (b) resampled salinity anomalies according to in situ observation, (c) equivalent IAP1° subsurface salinity anomalies, (d) truth value, (eh) reconstructed data from experiments performed with the original synthetic dataset, and (il) reconstructed data from experiments performed with the noised synthetic dataset. The fields for January 2016 are shown.
Remotesensing 14 05650 g006
Figure 7. (a) Distribution of synthetic remote sensing: SSTA (left y axis), ADTA (right y axis), and SSSA (right y axis) at the sea surface. The vertical section along 35.375°N in the Kuroshio and its extension region of (b) resampled salinity anomalies according to in situ observation, (c) equivalent IAP1° subsurface salinity anomalies, (d) truth value, (eh) reconstructed data from experiments performed with the original synthetic dataset, and (il) reconstructed data from experiments performed with the noised synthetic dataset. The fields for January 2016 are shown. Reconstructed data obtained from experiments on the original synthetic dataset.
Figure 7. (a) Distribution of synthetic remote sensing: SSTA (left y axis), ADTA (right y axis), and SSSA (right y axis) at the sea surface. The vertical section along 35.375°N in the Kuroshio and its extension region of (b) resampled salinity anomalies according to in situ observation, (c) equivalent IAP1° subsurface salinity anomalies, (d) truth value, (eh) reconstructed data from experiments performed with the original synthetic dataset, and (il) reconstructed data from experiments performed with the noised synthetic dataset. The fields for January 2016 are shown. Reconstructed data obtained from experiments on the original synthetic dataset.
Remotesensing 14 05650 g007
Figure 8. Vertical section along 35.375°N in the Gulf Stream region of the reconstructed data and truth values of the CNRM-CM6-1-HR model. The fields for January 1998 are shown. (ad) Experiments performed with the original synthetic dataset. (fi) Experiments performed with the noised synthetic dataset. (e) Truth value of CNRM-CM6-1-HR.
Figure 8. Vertical section along 35.375°N in the Gulf Stream region of the reconstructed data and truth values of the CNRM-CM6-1-HR model. The fields for January 1998 are shown. (ad) Experiments performed with the original synthetic dataset. (fi) Experiments performed with the noised synthetic dataset. (e) Truth value of CNRM-CM6-1-HR.
Remotesensing 14 05650 g008
Figure 9. (a) Vertical distributions of RMSEs between the reconstructed data and the truth values for the four experiments with the original synthetic dataset. (b) Vertical distribution of RMSEs between the reconstructed data and the truth values for the four experiments with the noised synthetic dataset. (cf) Vertical distribution of RMSEs of the individual experiments with the original versus the noised synthetic datasets. (g) Bar graphs of average RMSEs in the 1–2000 m range of the four experiments. (h) Noise-induced RMSE degradation rates of the four experiments.
Figure 9. (a) Vertical distributions of RMSEs between the reconstructed data and the truth values for the four experiments with the original synthetic dataset. (b) Vertical distribution of RMSEs between the reconstructed data and the truth values for the four experiments with the noised synthetic dataset. (cf) Vertical distribution of RMSEs of the individual experiments with the original versus the noised synthetic datasets. (g) Bar graphs of average RMSEs in the 1–2000 m range of the four experiments. (h) Noise-induced RMSE degradation rates of the four experiments.
Remotesensing 14 05650 g009
Figure 10. (a) Vertical distributions of CCs between the reconstructed data and the truth value for the four experiments with the original synthetic dataset. (b) Vertical distribution of CCs between the reconstructed data and the truth value for the four experiments with the noised synthetic dataset. (cf) Vertical distribution of CCs of the individual experiments on the original versus the noised synthetic datasets. (g) Bar graphs of average CCs in the 1–2000 m range of the four experiments. (h) Noise-induced CC degradation rates of the four experiments.
Figure 10. (a) Vertical distributions of CCs between the reconstructed data and the truth value for the four experiments with the original synthetic dataset. (b) Vertical distribution of CCs between the reconstructed data and the truth value for the four experiments with the noised synthetic dataset. (cf) Vertical distribution of CCs of the individual experiments on the original versus the noised synthetic datasets. (g) Bar graphs of average CCs in the 1–2000 m range of the four experiments. (h) Noise-induced CC degradation rates of the four experiments.
Remotesensing 14 05650 g010
Figure 11. Average RMSEs and CCs in the 1–2000 m depth range and the RMSEs and CCs at three typical layers of the four experiments: (a) averaged over 1–2000 m; and in the depth of (b) 10 m; (c) 100 m; (d) 800 m. The scatter points in the diagrams represent different schemes, the radiation lines represent CCs, and the horizontal and vertical axes represent RMSEs.
Figure 11. Average RMSEs and CCs in the 1–2000 m depth range and the RMSEs and CCs at three typical layers of the four experiments: (a) averaged over 1–2000 m; and in the depth of (b) 10 m; (c) 100 m; (d) 800 m. The scatter points in the diagrams represent different schemes, the radiation lines represent CCs, and the horizontal and vertical axes represent RMSEs.
Remotesensing 14 05650 g011
Figure 12. (a,c) Bar graphs of average RMSEs and CCs in the 1–2000 m depth range of the four reconstruction experiments performed with the original synthetic dataset and the large-noise synthetic dataset ( 5 × I A P 1 _ s t d × ε noise added to the true value). (b,d) Noise-induced RMSE and CC degradations.
Figure 12. (a,c) Bar graphs of average RMSEs and CCs in the 1–2000 m depth range of the four reconstruction experiments performed with the original synthetic dataset and the large-noise synthetic dataset ( 5 × I A P 1 _ s t d × ε noise added to the true value). (b,d) Noise-induced RMSE and CC degradations.
Remotesensing 14 05650 g012
Figure 13. Time series of global average RMSEs in the 1–2000 m depth range and at the three layers. (ad) Experiments performed with the original synthetic dataset. (eh) Experiments performed with the noised synthetic dataset.
Figure 13. Time series of global average RMSEs in the 1–2000 m depth range and at the three layers. (ad) Experiments performed with the original synthetic dataset. (eh) Experiments performed with the noised synthetic dataset.
Remotesensing 14 05650 g013
Figure 14. Time series of global average CCs in the 1–2000 m depth range and at the three layers of the four experiments. (ad) Experiments performed with the original synthetic dataset. (eh) Experiments performed with the noised synthetic dataset.
Figure 14. Time series of global average CCs in the 1–2000 m depth range and at the three layers of the four experiments. (ad) Experiments performed with the original synthetic dataset. (eh) Experiments performed with the noised synthetic dataset.
Remotesensing 14 05650 g014
Figure 15. Spatial distributions of the average RMSEs in the 1–2000 m depth range of the four experiments. (ad) Experiments performed with the original synthetic dataset. (eh) Experiments performed with the noised synthetic dataset.
Figure 15. Spatial distributions of the average RMSEs in the 1–2000 m depth range of the four experiments. (ad) Experiments performed with the original synthetic dataset. (eh) Experiments performed with the noised synthetic dataset.
Remotesensing 14 05650 g015
Figure 16. Spatial distributions of the average CCs in the 1–2000 m depth range of the four experiments. (ad) Experiments performed with the original synthetic dataset. (eh) Experiments performed with the noised synthetic dataset.
Figure 16. Spatial distributions of the average CCs in the 1–2000 m depth range of the four experiments. (ad) Experiments performed with the original synthetic dataset. (eh) Experiments performed with the noised synthetic dataset.
Remotesensing 14 05650 g016
Figure 17. (ah) Difference between the linear trends of the reconstructed experiment and truth values in the mean salinity of S2000 from 1993 to 2018 over the globe. (i) Linear trend of salinity truth value of CNRM-CM6-1-HR model.
Figure 17. (ah) Difference between the linear trends of the reconstructed experiment and truth values in the mean salinity of S2000 from 1993 to 2018 over the globe. (i) Linear trend of salinity truth value of CNRM-CM6-1-HR model.
Remotesensing 14 05650 g017
Table 1. Experimental design.
Table 1. Experimental design.
Experiment InputTruth ValueVertical Layering SchemeML Approach
Case NNtime, longitude, latitude, depth, equ-IAP1SA, ADTA, SSTA, USSWA, VSSWAOriginal/NoisedNon-layered, one modelFFNN
Case LGOriginal/NoisedNon-layered, one modelLightGBM
Case NNLOriginal/NoisedDivided into 41 layers, with one model for each layerFFNN
Case LGLOriginal/NoisedDivided into 41 layers, with one model for each layerLightGBM
Table 2. RMSE, CC, and noise-induced RMSE and CC degradation rates for the four experiments. Red bold and black bold represent the best and worst performers for that metric, respectively.
Table 2. RMSE, CC, and noise-induced RMSE and CC degradation rates for the four experiments. Red bold and black bold represent the best and worst performers for that metric, respectively.
Original Synthetic DatasetNoised Synthetic DatasetDegradation Rate (%)
RMSE (psu)CCRMSE (psu)CCRMSECC
Case NN0.0350.8660.0390.83512.03.5
Case LG0.0360.7840.0410.73412.26.4
Case NNL0.0390.8610.0420.8436.72.1
Case LGL0.0320.9190.0370.88015.54.3
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Tian, T.; Leng, H.; Wang, G.; Li, G.; Song, J.; Zhu, J.; An, Y. Comparison of Machine Learning Approaches for Reconstructing Sea Subsurface Salinity Using Synthetic Data. Remote Sens. 2022, 14, 5650. https://doi.org/10.3390/rs14225650

AMA Style

Tian T, Leng H, Wang G, Li G, Song J, Zhu J, An Y. Comparison of Machine Learning Approaches for Reconstructing Sea Subsurface Salinity Using Synthetic Data. Remote Sensing. 2022; 14(22):5650. https://doi.org/10.3390/rs14225650

Chicago/Turabian Style

Tian, Tian, Hongze Leng, Gongjie Wang, Guancheng Li, Junqiang Song, Jiang Zhu, and Yuzhu An. 2022. "Comparison of Machine Learning Approaches for Reconstructing Sea Subsurface Salinity Using Synthetic Data" Remote Sensing 14, no. 22: 5650. https://doi.org/10.3390/rs14225650

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop