1. Introduction
Dissolved oxygen (DO) serves as a vital biogeochemical indicator for assessing nearshore ecosystem health, exerting profound impacts on fisheries productivity and aquaculture sustainability [
1]. Decades of empirical studies have revealed that the dynamic coupling mechanisms between DO concentrations and key environmental parameters—including water temperature (Tem), salinity (Sal), pH, and chlorophyll-a (Chl-a)—manifest pronounced spatiotemporal heterogeneity across coastal systems [
2,
3]. The expanding implementation of marine ranching operations has introduced additional complexity to these environmental interactions through intensified anthropogenic-natural forcing [
4,
5]. As a new aquaculture model based on ecological engineering principles, marine ranching, through the intelligent perception of environmental parameters and ecosystem regulation, not only revolutionizes the traditional aquaculture paradigm [
6] but also builds a synergy mechanism of ecological, economic, and social benefits [
7]. However, in situ monitoring of its ecological elements still faces three technical bottlenecks: biofouling and equipment corrosion caused by prolonged deployment lead to an average annual failure rate of sensors of 22.5% (95% confidence interval: 15–30%) [
8,
9]; dynamic disturbances caused by ocean turbulence and extreme events increase the abnormal rate of monitoring data by three to five times [
10]; and the lack of a standardized processing framework for multi-source heterogeneous data fusion. Therefore, it is necessary to obtain accurate predictions of the trend of DO changes based on early observation data, understand the process of changes in marine ecological parameters, and establish a dynamic early warning system for marine ranching ecological parameters to provide decision support for the management of marine ranching.
Nevertheless, contemporary dissolved oxygen (DO) prediction research confronts two critical methodological constraints: first, the lack of a standardized framework for parameter selection leads to significant differences in variable combinations used in different studies, limiting the model’s generalization ability; second, the disconnect between parameter optimization mechanisms and ecological driving logic, as most models rely on fixed parameter combinations and ignore the dynamic interactions and spatiotemporal heterogeneity between parameters, constraining further improvements in prediction accuracy.
Although machine learning methods (such as random forest (RF) and support vector machines (SVMs)) have demonstrated high accuracy in DO prediction [
11,
12], the selection of input parameters (e.g., Tem, Chl-a, turbidity (Tur)) lacks a unified framework, limiting the model’s generalization capability [
13]. For example, Li et al. (2023) [
14] compared the DO prediction performance of seven parameter combinations in the Yangtze River Estuary and found that redundant variables (e.g., weakly correlated Tur) introduce noise, while omitting key parameters (e.g., photosynthetically active radiation) leads to model bias; in a study conducted by Yang et al. (2024) [
15], a univariate recursive forecasting method was used to establish a model, and the historical time series of DO was used to predict future values. This model does not consider the impact of environmental factors and other water quality indicators on DO, which limits the universality of the prediction model to a certain extent.
Additionally, most models rely on fixed parameter weights [
16], failing to capture the dynamic interactions between parameters (e.g., nonlinear coupling between Tem and Sal) [
17]. To address model interpretability issues, explainable techniques such as SHAP (Shapley Additive exPlanations) and LIME (Local Interpretable Model-Agnostic Explanations) have been gradually introduced into ecological prediction [
18,
19]. For instance, in a study conducted by Cui et al. (2024) [
20], extreme gradient boosting (XGBoost) was utilized to predict the amplitude of SSTC from the 12 predictors. Shadkani et al. (2024) [
21] utilized SHAP analysis to investigate the contribution of each predictor to DO prediction, and the analysis revealed that temperature had the greatest contribution to DO prediction in the data from the Illinois River (ILL) and Des Plaines River (DP).
To better interpret the ocean–atmosphere interaction, a SHAP method is further employed to identify the contributions of predictors in determining the amplitude of the TC-induced SSTC, bringing the attribute-oriented explainability to the proposed method. In addition, although complex models such as deep learning can improve prediction accuracy through automatic feature extraction, their black-box nature obscures the contribution of key ecological parameters, posing a dual challenge of interpretability (Rudin, 2019) and adaptability when applied across regions [
22].
The relative contributions of key environmental factors, such as Tem, Sal, pH and Chl-a, to DO dynamics remain underexplored. DO dynamics are governed by the synergistic effects of Tem, Sal, pH, and Chl-a [
23]. These interactions involve complex physical–chemical–biological coupling processes: Tem directly modulates oxygen solubility while simultaneously enhancing photosynthetic activity. As the Sal of water bodies increases, the saturation of DO will decrease accordingly. Chl-a concentration, representing phytoplankton biomass, drives diurnal oxygen fluctuations through daytime photosynthetic production and nighttime respiratory consumption (concurrently altering pH via CO
2 release). Seasonal typhoons and rainfall events significantly influence Sal profiles and vertical mixing, thereby affecting DO distribution.
Future research needs to integrate multi-source data (e.g., remote sensing, in situ sensors) with adaptive parameter optimization algorithms (e.g., reinforcement learning-based dynamic feature selection) to enhance the spatiotemporal generalization capability of models [
24]. Meanwhile, developing “gray-box” models (e.g., physics-informed neural networks) that balance accuracy and interpretability will be a breakthrough direction [
25].
For the above issues, this study proposes a “parameter screening-model optimization-ecological interpretation” technical closed-loop, using the multi-source observational data of the Goji Island marine ranching as the research object, systematically analyzing the regulatory mechanism of input parameters on the DO prediction model. By integrating correlation analysis, principal component analysis (PCA), and Shapley Additive exPlanation (SHAP) value interpretation technology, a data-driven variable screening standard is established. This study innovatively reveals the impact path of input parameter optimization on model performance, providing theoretical support for the transformation of the DO prediction model from “accuracy-oriented” to “accuracy-interpretability-adaptability synergistic optimization”. This method system not only enhances the reliability of the intelligent monitoring of DO in marine ranching but also provides a universal paradigm for ecological model optimization under the fusion of multi-source heterogeneous data.
2. Materials and Methods
2.1. Data Sources
The data were obtained from the monitoring platform deployed at the Goji Island Marine Ranching (30°42′ N 122°46′ E), Zhejiang Province. This marine ranching is located on the continental shelf of the East China Sea, characterized by a subtropical monsoon climate and abundant fishery resources. It primarily cultivates shellfish, algae, and fish species such as large yellow croaker and sea bass, serving as an important aquaculture base in the East China Sea. The monitoring platform was equipped with sensors for various parameters deployed (dissolved oxygen, conductivity-temperature-depth (CTD), pH and chlorophyll turbidity sensors integrated on the data collector, Institute of Oceanographic Instrumentation, Shandong, China) at the sea surface, capable of continuously measuring six key ecological factors, including Tem, DO, and Sal (
Figure 1). Parameters were sampled hourly, and, during operation, field personnel regularly collected water samples for laboratory analysis to compare with sensor data to verify the accuracy of the data.
As a typical aquaculture area in the East China Sea, the multi-source observational data from the marine ranching on Goji Island can reflect the complex dynamics of the nearshore ecosystem, providing a robust data foundation for this study.
The data utilized in this article cover the period from 22 September 2022, at 12:00 to 24 October 2022, at 11:00, recorded at hourly intervals, totaling 767 ecological data sets. Each data set encompasses six principal ecological parameters: Tem, Sal, pH, DO, Chl-a, and Tur.
The data were noted to have certain missing values and error rates attributable to environmental and communication variables, with the missing rate below 3%. Interpolation was employed to augment the absent data, succeeded by normalization processing [
26]. Cubic spline interpolation, chosen for its effectiveness in maintaining the continuity of the parameters, was used to impute missing entries, resulting in optimal continuous observed data.
2.2. Methodologies for Reasearch
The methodological framework of this study is depicted in
Figure 2. Firstly, the Pearson and Spearman correlation coefficients were selected to comprehensively analyze ecological variable relationships (MATLAB R2024a). Pearson assesses linear dependencies, ideal for variables like Tem and DO, while Spearman captures monotonic, nonlinear relationships, as it is less sensitive to outliers. This dual approach ensures robustness by accounting for both linear and nonlinear patterns, providing a nuanced understanding of variable interactions. Correlation analyses utilizing Pearson and Spearman coefficients were conducted to discern strong association characteristics among the parameters and to formulate a reference group for coefficient inputs, excluding groups with weak correlations and discarding those with moderate correlations.
Subsequently, PCA (MATLAB R2024a) was then applied to distill the most salient features from the high-dimensional data. PCA enhances the model’s generalization capability by extracting principal components from the data, thereby reducing redundant variables and noise.
Ultimately, we segmented the data set into training and testing sets in an 8:2 ratio. Then, we input the training set into three machine learning models, SVM, Multilayer Perceptron (MLP), and RF (MATLAB R2024a), respectively. The predicted values of DO were obtained and subsequently subjected to comparative analysis with the measured values.
The three models each possess distinct advantages and disadvantages. The RF model excels in handling multi-dimensional features, albeit with relatively high computational complexity. The MLP is well suited for large-scale datasets but is prone to overfitting. The SVM performs effectively with small sample sizes, yet its capability to process nonlinear data is limited.
This process significantly reduces the risk of overfitting by addressing two key aspects. First, the correlation analysis eliminated redundant or low-relevance parameters, ensuring that only the most meaningful features were retained. This reduced the model’s complexity and prevented it from capturing noise or irrelevant patterns in the data. Second, the PCA step further compressed the feature space by transforming the input variables into a set of uncorrelated principal components, which retain the most significant variance in the data. This dimensionality reduction not only enhances computational efficiency but also mitigates the risk of overfitting by focusing on the most informative components.
Finally, in the concluding phase of our experimental investigation, we employed a suite of machine learning algorithms, including SVM, MLP, and RF, to conduct predictive analytics.
Table 1 presents a systematic comparative analysis of three machine learning methodologies.
Subsequent to the acquisition of predictive outcomes, a comprehensive analysis was performed to elucidate the efficacy and performance metrics of each model within the context of our study.
This article employs three metrics to assess the model’s performance: Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Coefficient of Determination (R
2) (MATLAB R2024a) [
27].
The RMSE serves as a metric to quantify the deviation between observed and true values, effectively capturing measurement precision while exhibiting heightened sensitivity to outliers. Notably, when a predicted value substantially deviates from the actual value, the RMSE will yield a significantly larger value.
In contrast, the MAE circumvents the issue of error cancelation, thereby providing a more accurate representation of the actual prediction error. This characteristic makes MAE particularly valuable in scenarios where error compensation might obscure the true magnitude of prediction inaccuracies.
The R2 offers a standardized measure of predictive accuracy. As this metric approaches unity, it indicates an increasingly close correspondence between predicted and actual values, with a value of 1 representing perfect predictive accuracy. This statistical measure is particularly useful for assessing the proportion of variance in the dependent variable that is predictable from the independent variable(s).
2.3. Input Parameter Selection Mechanism
The input parameter selection was based on comprehensive correlation analysis using both Pearson and Spearman correlation coefficients to evaluate the relationships between potential predictors and DO concentrations. Parameters with the lowest absolute correlation values (|r| ≤ 0.5) were identified as having relatively weaker predictive power [
28].
To systematically assess the impact of parameter selection, we established three distinct experimental groups:
Group a (Baseline): Full parameter set (Tem, Sal, DO, Chl-a, pH, Tur).
Group b (Low-correlation Excluded): Removed the parameter with the lowest correlation.
Group c (Control): Intentionally excluded a parameter with moderate correlation.
This tri-group experimental design enables rigorous validation of parameter selection effects through comparative analysis of model performance metrics (RMSE, MAE, R2). The control group serves as a critical benchmark to verify that the exclusion of low-correlation parameters is not coincidental but statistically justified.
4. Conclusions
4.1. Ecological Mechanism Analysis
Based on in situ monitoring data from the marine ranching construction project in Goji Island, East China Sea, six key parameters—Tem, Sal, DO, Chl-a, pH, and Tur—were selected for high-frequency monitoring campaigns. This selection aligns with ecological protection requirements, technical specifications, and cost-control considerations to assess ecosystem health. The ecological functions of these parameters are defined as follows: Tem regulates marine organisms’ metabolic rates and biogeographic distribution; Sal influences species composition and cellular activity via osmotic pressure; DO serves as the core indicator for aerobic respiration and energy metabolism; pH governs enzymatic activity and carbonate system equilibrium; Chl-a quantifies phytoplankton biomass and primary productivity; Tur reflects suspended particulate concentrations.
This study utilizes observational data obtained from marine aquaculture near Goji Island, specifically during the transitional period between late summer and early autumn (September to October). This timeframe is characterized by significant fluctuations in marine environmental factors, during which DO is synergistically regulated by Tem, Sal, pH, and Chl-a. Based on these factors, the ecological interpretation of ecological parameters in the DO prediction model is as follows:
It is widely recognized that Tem exhibits a negative correlation with DO. Consistent with this observation, the PDP analysis of the predictive model suggests a potential strengthening of the correlation between Tem and DO as Tem decreases. This suggests that declining autumn Tem could become a predominant factor influencing DO.
An increase in Sal typically leads to a reduction in the saturation of DO in water. Additionally, Sal can indirectly reflect phenomena such as water mixing or stratification (e.g., freshwater from terrestrial sources or bottom water), which also affect the concentration of DO in the water body. In this study, we experimented with removing Sal as an input parameter, and the results indicated that the absence of Sal significantly impacts the accuracy of the outcomes. Therefore, Sal is a parameter that requires close attention.
The pH level in seawater represents its acidity and alkalinity, which influences biological activities in aquatic systems. For instance, CO2 generated from phytoplankton photosynthesis or direct impacts on microbial activity may establish correlations with DO, consistent with existing research findings. Therefore, monitoring seawater pH and DO is critical for assessing aquatic health and providing early warnings for hypoxia or acidification events, particularly in aquaculture and coral reef conservation.
With the progression of seasonal changes, the decline in Chl-a concentration leads to a corresponding reduction in its contribution to the predictive outcomes of DO. Chl-a serves as an indicator of phytoplankton abundance in seawater. During the summer months, when sunlight is abundant and Tem are elevated, phytoplankton engage in photosynthesis, absorbing CO2 and generating O2, while simultaneously consuming oxygen. The analytical results of this study corroborate that the weight of Chl-a in the prediction of DO diminishes as its concentration decreases with the changing seasons.
Taking ecological factors into account, Tur in seawater primarily reflects the concentration of suspended particulate matter. The composition of suspended matter in seawater is complex, potentially encompassing both inorganic and organic substances. If the suspended matter is organic and biologically active, it may participate in photosynthesis or consume oxygen during decomposition processes, whereas inorganic matter may consist of nutrients or pollutants. Given that current monitoring methods are unable to precisely identify the predominant components of suspended matter, the mechanisms by which Tur influences DO concentrations remain unclear. Based on preliminary data analysis from the selected region and monitoring period of this study, it can only be inferred that the impact of Tur on DO is relatively minor. Nonetheless, in the practical computational process, the exclusion of Tur can reduce noise within the model and enhance the accuracy of predictions. Further analysis of Tur’s composition could potentially provide a more robust ecological explanation.
Sensor limitations precluded the integration of external drivers (light intensity, air pressure, wind speed) regulating photosynthesis, air–sea exchange, and surface mixing. While our model captures physics–chemistry–biology couplings, predictive stability under extreme meteorological conditions requires enhanced multi-source data fusion.
4.2. Discussion
This study establishes an integrated framework encompassing parameter selection, model optimization, and ecological analysis to elucidate key regulatory mechanisms in seawater DO prediction models and their ecological applications.
The comparative analysis of monitoring data from Goji Island marine ranching demonstrates that optimized parameter selection significantly enhances prediction accuracy. Correlation analysis (Pearson/Spearman coefficients) and PCA revealed minimal Tur-DO correlation (r < 0.15). Subsequent model comparisons (SVM, MLP, RF) showed that the PCA-RF model (excluding Tur) outperformed others with the RMSE = 0.039, MAE = 0.030, and R2 = 0.884, achieving 45.5%, 28.6%, and 3.3% improvements over the full-parameter model, respectively. Notably, despite moderate Sal-DO correlation (Pearson r = 0.42), Sal omission severely degraded performance (RMSE = 0.096, MAE = 0.073, R2 = 0.251), confirming its critical physicochemical regulatory role.
- 2.
Parameter Importance Hierarchy
A multi-method assessment (Taylor diagrams, SHAP values, partial dependence analysis) established the parameter hierarchy:
Tem > pH > Sal > Chl-a > Tur
This hierarchy aligns with correlation and PCA results. Tem dominated DO variability, while Tur introduced model noise. The findings provide mechanistic insights into seasonal DO dynamics during summer–autumn transitions and guide the monitoring parameter selection for marine ranching.
- 3.
Ecological Implications
This study highlights two operational principles for regional DO prediction: Prioritize the real-time monitoring of Tem, Sal, pH, and Chl-a and exclude Tur to enhance model robustness. Spatiotemporal heterogeneity in parameter weights suggests that seasonal adjustment mechanisms may optimize predictive models. This parameter optimization strategy improves both the model accuracy and ecological interpretability of DO dynamics.
The current research scope is constrained by the observational data derived from marine aquaculture monitoring systems, which may limit model generalizability and predictive accuracy. Specifically, the data set emphasizes locally measurable parameters (Tem, Sal, pH, Chl-a, Tur) while overlooking the external drivers of DO dynamics, such as meteorological and large-scale hydrological factors. This omission may reduce model applicability to marine ecosystems with distinct environmental regimes or intricate biogeochemical interactions.
To mitigate these limitations and strengthen framework robustness, future studies will extend monitoring protocols to encompass broader environmental variables. This expansion will incorporate remote sensing-derived parameters: wind speed (modulating air–sea gas exchange); atmospheric pressure (controlling interfacial oxygen flux); precipitation (modifying Sal gradients via freshwater influx); and photosynthetically active radiation (regulating phytoplankton-mediated primary production). The integration of these variables will enhance the model’s capacity to resolve DO variability drivers in spatially heterogeneous and temporally dynamic marine environments.
By integrating these supplementary variables, the model is enhanced in its capacity to encapsulate the multifaceted driving factors contributing to DO variability, particularly within dynamic and heterogeneous marine ecosystems.