Saved Queries

Road traffic accidents at intersections pose a persistent challenge in Riyadh, Saudi Arabia, contributing significantly to public health burdens and economic losses. Traditional statistical approaches often fail to capture the complex, non-linear interactions among geometric design, traffic parameters, and accident severity. This study develops a multi-methodological machine learning framework to predict intersection accident severity using the Equivalent Property Damage Only (EPDO) metric. Historical data (2017–2023) from Riyadh Municipality for 150 high-risk intersections were analyzed, incorporating predictors such as service road distance (SRD), U-turn distance (UTD), median width (MW), peak hour volume (PHV), heavy vehicle percentage (HV%), and injury/frequency counts. Six algorithms, i.e., Decision Tree, Random Forest, Gradient Boosting, Support Vector Machine, Linear Regression, and Artificial Neural Network, were compared using a 70/30 train–test split and k-fold cross-validation in this study. The Gradient Boosting model achieved superior performance (R² = 0.89 with MSE = 63.43 and RMSE = 7.96) and was selected for final deployment. SHAP feature importance analysis revealed minor injuries (MIs), serious injuries (SRIs), and fatalities (FAs) as the most important dominant predictors, with geometric factors (UTD, MW) and traffic composition (HV%) providing actionable infrastructure insights. The model ranked intersections and identified the “Jeddah Road with Taif Road” (predicted EPDO = 137.22) as the highest-risk location. Evidence-based recommendations include enforcing the minimum 300 m U-turn buffers with staggering service road exits ≥150 m and restricting heavy vehicles during peak hours. The scalable framework developed in this study supports the data-driven prioritization of safety interventions and aligns with sustainable urban mobility goals and offers transferability to other metropolitan contexts worldwide. Full article

(This article belongs to the Collection Accident Prevention and Risk Management for Safe and Sustainable Transportation)

►▼ Show Figures

Figure 1

31 pages, 4926 KB

Open AccessArticle

Interpretable Optimized Extreme Gradient Boosting for Prediction of Higher Heating Value from Elemental Composition of Coal Resource to Energy Conversion

by Paulino José García-Nieto, Esperanza García-Gonzalo, José Pablo Paredes-Sánchez and Luis Alfonso Menéndez-García

Big Data Cogn. Comput. 2026, 10(4), 112; https://doi.org/10.3390/bdcc10040112 - 7 Apr 2026

Abstract

The higher heating value (HHV), sometimes referred to as the gross calorific value, is a crucial metric for determining a fuel’s primary energy potential in energy production systems. By combining extreme gradient boosting (XGBoost) with the differential evolution (DE) optimizer, an innovative machine learning-based model was created in this study to forecast the HHV (dependent variable). As input variables, the model included the constituents of the coal’s ultimate analysis: carbon (C), oxygen (O), hydrogen (H), nitrogen (N), and sulfur (S). For comparative purposes, random forest regression (RFR), M5 model tree, multivariate linear regression (MLR), and previously reported empirical correlations were also applied to the experimental dataset. The results showed that the XGBoost strategy produced the most accurate predictions. An initial XGBoost analysis was carried out to identify the relative contribution of the input variables to coal HHV prediction. In particular, for coal HHV estimates reliant on experimental samples, the XGBoost regression produced a correlation coefficient of 0.9858 and a coefficient of determination of 0.9691. The excellent agreement between observed and anticipated values shows that the DE/XGBoost-based approximation performed satisfactorily. Lastly, a synopsis of the investigation’s key conclusions is provided. Full article

(This article belongs to the Special Issue Smart Manufacturing in the AI Era)

15 pages, 1148 KB

Open AccessArticle

Early Prediction of Well-Being Outcomes in Older Adults Using Explainable AI and Emotional Intelligence Measures

by Evgenia Kouli, Evangelos Bebetsos, Maria Michalopoulou and Filippos Filippou

Appl. Sci. 2026, 16(7), 3586; https://doi.org/10.3390/app16073586 - 7 Apr 2026

Abstract

Background: Well-being in the elderly is shaped by complex emotional and social factors. Early identification of individuals at risk for reduced well-being may support timely preventive or supportive interventions. This study examined whether emotional intelligence indicators collected at baseline can predict well-being status 5 months later using explainable machine learning models. Methods: A cohort of elderly participants aged 60 to 89 years completed emotional intelligence measures at baseline, and well-being was assessed 5 months later using the POMS questionnaire. Four machine learning algorithms, Logistic Regression (LR), Support Vector Machines (SVM), Random Forest (RF), and Extreme Gradient Boosting (XGBoost), were developed using 5-fold stratified cross-validation. Model performance was evaluated through accuracy, precision, recall, F1-score, ROC AUC, and normalized confusion matrices. SHapley Additive exPlanations (SHAP) were applied to interpret the contribution and directionality of each predictor. Results: XGBoost achieved the highest predictive performance (accuracy = 0.789; F1 = 0.778) and demonstrated balanced classification across well-being categories. SVM also performed robustly (accuracy = 0.760), while LR showed reduced sensitivity for detecting those with poorer well-being. SHAP analysis identified self-control, emotionality, sociability, self-motivation, and well-being components as the most influential predictors. Lower emotionality, higher sociability, and higher self-control scores were linked to a greater probability of favorable well-being outcomes. Conclusions: The findings demonstrate the feasibility of using explainable machine learning models to predict 5-month well-being status within this sample of older adults using emotional intelligence indicators. XGBoost provided the strongest and most balanced performance, while SHAP analysis clarified how specific emotional intelligence dimensions influenced predictions. These findings suggest that interpretable machine learning approaches may support future efforts toward early recognition of older adults who may be at risk for reduced well-being and guide personalized intervention strategies. Full article

(This article belongs to the Special Issue AI-Driven Innovations in Rehabilitation: Integrating Neurological, Musculoskeletal, Sports Medicine and Occupational Therapy Interventions)

►▼ Show Figures

Figure 1

16 pages, 4263 KB

Open AccessArticle

Application of Near-Infrared Spectroscopy in Moisture Detection of Carrot Slices During Freeze-Drying

by Pengtao Wang, Meng Sun, Hongwen Xu, Moran Zhang, Rong Liu, Yunfei Xie and Jun Cheng

Foods 2026, 15(7), 1256; https://doi.org/10.3390/foods15071256 - 7 Apr 2026

Abstract

This study explored the feasibility of near-infrared (NIR) spectroscopy for detecting total water, free water and bound water in carrot slices during freeze-drying, with low-field nuclear magnetic resonance (LF-NMR) characterizing water state distribution and oven-drying determining moisture content (MC). NIR spectra (10,000–4000 cm⁻¹) were processed via optimized sample partitioning, preprocessing and feature extraction; partial least squares regression (PLSR), support vector regression (SVR), back-propagation artificial neural network (BPANN), extreme gradient boosting (XGBoost) and particle swarm optimization–random forest (PSO-RF) models were established and evaluated. Results showed that SVR and BPANN performed robustly, with CARS being the optimal feature extraction method. The full-moisture system achieved high total/free water prediction accuracy (

R_{p}^{2}

= 0.9902/0.9740), while the low-moisture system improved bound water prediction (

R_{p}^{2}

= 0.9709). The established NIR models exhibited excellent fitting and generalization ability, enabling rapid and non-destructive quantitative prediction of moisture content during carrot freeze-drying. Full article

(This article belongs to the Section Food Analytical Methods)

►▼ Show Figures

Figure 1

22 pages, 3050 KB

Open AccessArticle

Event-Based Dual-Task Forecasting for SLA-Oriented Hospital Transport Operations Using Machine and Deep Learning Models

by Murat Akın

Appl. Sci. 2026, 16(7), 3570; https://doi.org/10.3390/app16073570 - 6 Apr 2026

Abstract

Service Level Agreement (SLA) compliance in hospital transport processes is essential in terms of patient safety, service continuity, and resource efficiency. However, transport requests occur as irregular events, limiting the applicability of equally spaced time-series assumptions. The presented study jointly addresses two complementary objectives in an event-based framework: predicting the interarrival time between consecutive transport requests (next-event forecasting) and forecasting the total request count within forward SLA horizons (forward-count forecasting). Machine learning methods such as Ridge Regression, Extra Trees, and Histogram-based Gradient Boosting, as well as deep learning architectures such as Long Short-Term Memory and Gated Recurrent Unit, were compared under different time horizons and adaptive history windows on time-stamped transport request records from the operational system supporting a private hospital in Turkey, including patient, specimen, and material transport requests. Results indicate that deep learning methods yield lower errors in demand count prediction at short time horizons; as the horizon lengthens, machine learning performs similarly and even outperforms in some cases; and as the history window increases, the prediction error for the next request occurrence systematically decreases. The lowest mean absolute error values in request counts were obtained for demand forecasting within a 30 min time window; 2.10 for material transport, 3.88 for patient transport, and 2.84 for specimen transport. Additionally, R² value reached 0.98 for next-event forecasting with a rolling-memory window of 20 events. Overall, the findings suggest that hospital transport demand is substantially predictable and that event-based forecasting can support SLA-oriented staffing, task dispatching, and delay mitigation. Full article

►▼ Show Figures

Figure 1

19 pages, 6202 KB

Open AccessArticle

Yield Prediction in Winter Oilseed Rape Based on Multi-Temporal NDVI and Modelling Approaches

by Edyta Okupska, Antanas Juostas, Dariusz Gozdowski and Elżbieta Wójcik-Gront

Agronomy 2026, 16(7), 763; https://doi.org/10.3390/agronomy16070763 - 5 Apr 2026

Viewed by 107

Abstract

Accurate prediction of winter oilseed rape yield is essential for optimising crop management and improving production efficiency. However, the reliability of commonly reported model performance remains uncertain due to the widespread use of random validation strategies. This study evaluated the predictive potential of multi-temporal Normalised Difference Vegetation Index (NDVI) metrics collected between September 2023 and May 2024 for yield estimation across multiple Lithuanian fields, while explicitly addressing spatial generalisation. The analytical dataset comprised dry yield (t ha⁻¹), monthly NDVI, and field identifiers, and underwent quality control, including outlier removal. Four modelling approaches were compared: ordinary least squares (OLS) regression, Random Forest (RF), Extreme Gradient Boosting (XGBoost), and a Deep Neural Network (DNN). Model performance was assessed using both random (80/20) and a spatially independent field-wise (GroupSplit) validation schemes designed to assess model transferability to previously unseen fields, further extended by repeated group-based resampling to quantify variability in model generalisation. Under random sampling, RF and XGBoost achieved the highest accuracy (RMSE ≈ 0.85 t ha⁻¹, R² ≈ 0.55). However, under spatially independent validation, predictive performance declined markedly for all models, with tree-based ensembles showing near-zero R² values, indicating limited transferability to unseen fields. In contrast, the DNN demonstrated more consistent generalisation (RMSE = 1.09 t ha⁻¹, R² = 0.28). Repeated field-wise validation confirmed that performance estimates based on random splits substantially overestimate true predictive capability. Feature importance analyses consistently identified spring NDVI, particularly from March to May, as the dominant predictor of yield, whereas autumn NDVI showed weaker and less consistent relationships with yield. These findings demonstrate that a large portion of the predictive skill reported in NDVI-based yield modelling may arise from spatial information leakage rather than transferable crop-environment relationships. By explicitly quantifying the gap between random and spatial validation, this study provides a more realistic benchmark for model performance and highlights the necessity of spatially robust evaluation frameworks for operational yield prediction in precision agriculture. Full article

(This article belongs to the Special Issue Agricultural Monitoring and Yield Assessment Through Remote Sensing and GIS)

►▼ Show Figures

Figure 1

16 pages, 5649 KB

Open AccessArticle

Improving Probabilistic Lightning Forecasts Through Ensemble Postprocessing with Mesoscale Information

by Haoyue Li, Ziqiang Huo and Jialing Wang

Atmosphere 2026, 17(4), 371; https://doi.org/10.3390/atmos17040371 - 3 Apr 2026

Viewed by 188

Abstract

Accurate short-term lightning forecasting requires reliable representations of both lightning occurrence and intensity, as well as the underlying convective processes. While ensemble prediction systems (EPSs) provide valuable probabilistic information, their ability to resolve mesoscale and convective-scale variability remains limited. In this study, we assess the added value of mesoscale information for probabilistic lightning forecasting over eastern China. A mesoscale ensemble is constructed from deterministic forecasts of the China Meteorological Administration (CMA) Mesoscale Model (MESO) using spatiotemporal neighborhood and time-lagged techniques and is combined with predictors from the CMA Regional Ensemble Prediction System (REPS). Lightning occurrence and counts are modeled within a Bayesian additive model for location, scale, and shape (BAMLSS) framework, using a hurdle-based count regression to account for excess zeros and overdispersion. Influential nonlinear predictors are selected via stability selection combined with gradient boosting. Forecast performance with and without MESO-derived predictors is systematically evaluated. The results indicate that incorporating mesoscale information generally improves forecast skill for both lightning occurrence and intensity across multiple verification metrics. These improvements are associated with MESO-derived predictors related to convective available potential energy and convective precipitation, suggesting the importance of mesoscale processes for probabilistic lightning forecasting. Full article

(This article belongs to the Special Issue Numerical Weather Prediction Models and Ensemble Prediction Systems (2nd Edition))

►▼ Show Figures

Figure 1

23 pages, 4047 KB

Open AccessArticle

UAV-Based Estimation of Tea Leaf Area Index in Mountainous Terrain: Integrating Topographic Correction and Interpretable Machine Learning

by Na Lin, Jian Zhao, Huxiang Shao, Miaomiao Wang and Hong Chen

Sensors 2026, 26(7), 2218; https://doi.org/10.3390/s26072218 - 3 Apr 2026

Viewed by 194

Abstract

Leaf Area Index (LAI) is a fundamental parameter for characterizing the growth of tea (Camellia sinensis L.). However, in rugged mountainous regions, the combined effects of topographic relief and canopy structural heterogeneity severely constrain the accuracy of UAV-based multispectral LAI retrieval. This study develops an integrated framework combining topographic correction with interpretable machine learning to improve LAI estimation. We utilized a UAV multispectral dataset collected during the peak growing season from a typical tea-growing region in Fujian Province, China (altitude range: 58–186 m), comprising a total of 90 samples. Three topographic correction methods, including Sun–Canopy–Sensor (SCS), SCS with C correction (SCS+C), and Minnaert+SCS, were evaluated in combination with Linear Regression (LR), Decision Tree (DT), Random Forest (RF), and Extreme Gradient Boosting (XGBoost) models. Results indicated that the SCS+C algorithm outperformed other methods by effectively accounting for direct and diffuse radiation components, thereby reducing topographic dependence while maintaining radiometric consistency across heterogeneous surfaces. The XGBoost model combined with SCS+C correction achieved the highest performance (R² = 0.8930, RMSE = 0.6676, nRMSE = 7.93%, MAE = 0.4936, Bias = −0.0836). SHapley Additive exPlanations (SHAP) analysis revealed a structure-dominated retrieval mechanism, in which red-band textural features (Correlation_R) exhibited higher importance than conventional vegetation indices. Compared with previous studies that primarily focus on either topographic correction or model development, this study provides quantitative insights into the underlying retrieval mechanisms. This framework improves the precision of tea LAI retrieval in complex terrains and provides a robust methodological basis for digital management in mountainous agriculture. Full article

(This article belongs to the Special Issue AI UAV-Based Systems for Agricultural Monitoring)

►▼ Show Figures

Figure 1

15 pages, 2161 KB

Open AccessArticle

Estimation of Exhaust Gas Concentrations from a Diesel Engine Powered by Diesel Fuel and Rapeseed Oil Operating Under Dynamic Conditions Using Machine Learning

by Michał Kuszneruk, Rafał Longwic, Krzysztof Górski and Dimitrios Tziourtzioumis

Energies 2026, 19(7), 1750; https://doi.org/10.3390/en19071750 - 2 Apr 2026

Viewed by 254

Abstract

This paper presents an analysis of the exhaust gas concentration of a compression ignition engine powered by diesel fuel and rapeseed oil under dynamic conditions. The measurement cycle consisted of a 100 s segment of the WLTC cycle. An attempt was then made to estimate the exhaust gas concentration using predictive algorithms based on parameters recorded using the OBD-II diagnostic interface. The model was validated based on previously unobserved measurements of the measurement cycle, and the procedure was repeated several times with random parameter changes. Due to the dynamic nature of the combustion process (taking into account its non-linearity and inertia), a delayed feature design was used. A consistent time horizon of input information was selected for the tabular and sequential models used. The results obtained indicated that Gradient-Boosted Regression Trees class algorithms achieved the highest quality of fit and were characterised by the greatest stability. Full article

(This article belongs to the Section I2: Energy and Combustion Science)

►▼ Show Figures

Figure 1

23 pages, 8508 KB

Open AccessArticle

Research on the Influence Mechanism of Urban Morphology Indicators on the Diurnal and Seasonal Surface Temperature

by Ruixi Liu, Xianglong Kong, Yutong Wu, Peng Cui and Guangpu Wei

Land 2026, 15(4), 585; https://doi.org/10.3390/land15040585 - 1 Apr 2026

Viewed by 370

Abstract

Urban morphology influences the distribution and variation of land surface temperature (LST) by altering surface cover type. However, the coupling effects of the daily LST cycle and the multidimensional morphological driving mechanisms remain insufficiently explored in existing studies. This study, based on ECOSTRESS diurnal LST data, focuses on Harbin, a representative city in China’s cold climate regions. By integrating land cover data, urban morphology vector data, and interpretable machine learning models, it investigates the intricate relationship between urban morphology indicators and LST over a 24 h cycle under cold climate conditions. LST prediction was carried out using Gradient Boosting Decision Trees (XGBoost), Random Forests (RF), Support Vector Machines (SVM), and Multiple Linear Regression (MLR), with an evaluation of the prediction accuracy of each model. The findings indicated the following: (1) The influence of 2D and 3D urban morphological indicators on LST exhibits significant seasonal variation, with Building Otherness (BO), Mean Building Height (BH), and the Normalized Difference Vegetation Index (NDVI) exerting notable impacts on LST in both winter and summer. (2) Significant interactions exist between certain urban morphological indicators that can effectively reduce LST, when the Patch Land Area Proportion (PLAND) exceeds 40%, increasing the Largest Patch Index (LPI) contributes to lowering LST in summer. (3) Among the evaluated machine learning algorithms, XGBoost demonstrates the highest prediction accuracy. This study provides scientific insights for urban planning and policy development, aiding in the optimization of urban morphological designs to effectively regulate LST. Full article

(This article belongs to the Section Land – Observation and Monitoring)

►▼ Show Figures

Figure 1

16 pages, 1045 KB

Open AccessArticle

Risk Level Assessment and Impact Range Analysis of CCUS CO₂ Pipeline Leakage Based on Machine Learning

by Haoyuan Zhang, Siqi Wang, Xiaoping Jia and Fang Wang

Safety 2026, 12(2), 44; https://doi.org/10.3390/safety12020044 - 31 Mar 2026

Viewed by 151

Abstract

In emergency decision-making for carbon capture, utilization, and storage (CCUS) CO₂ pipeline leakage, risk levels and warning distances/impact ranges are often derived from different methodological systems—risk-matrix scoring versus mechanistic consequence modeling. Differences in threshold definitions and modeling assumptions make it difficult to align level assignment with distance boundaries for the same scenario, which in turn reduces the comparability and traceability of multi-scenario batch screening. To address this, this study proposes an integrated framework based on “threshold impact-distance calculation–risk-matrix mapping,” with physical consequence quantification as the main thread. A scenario library (N = 4320) covering phase state, leak aperture, operating conditions, and meteorological fields is constructed; impact distances corresponding to CO₂ volume-fraction thresholds of 1%/4%/10% (R_1%, R_4%, R_10%) are computed and then mapped to five RiskLevel classes under a unified rule set, enabling standardized synchronous outputs. The modeling tasks are formulated as RiskLevel classification and threshold-distance regression. Using a stratified 70%/30% train–test split, Extreme Gradient Boosting (XGBoost) is adopted as the primary model and compared with logistic regression (LR), support vector classification (SVC), ordinary least squares regression (OLS), and support vector regression (SVR). Results show that XGBoost achieves an accuracy of 0.806 and a macro-F1 of 0.825 for RiskLevel classification, with a recall of 0.631 for the high-risk classes (RiskLevel 4–5), and yields mean absolute errors (MAEs) of 95/62/41 m for R_1%/R_4%/R_10% regression with coefficient of determination (R²) values of 0.795–0.814. Distributional analysis further indicates that threshold impact distances increase overall with higher RiskLevel, while dispersion becomes larger at higher levels. Accordingly, a parallel representation of “RiskLevel + multi-threshold rings” is recommended to support coordinated graded control and zoned warning delineation. Full article

►▼ Show Figures

Figure 1

23 pages, 6865 KB

Open AccessArticle

Integrating Hyperspectral Data and Deep Learning for Non-Destructive Prediction of Tea Quality Parameters Across Different Physical States of Tea Leaves and Growth Periods

by Guanzi Zhou, Haotian Ji, Rongyu Pan, Xiaowei Yang, Suhui Zhao, Lei Yang, Xiaohan Shang, Huijie Zhang, Hanchi Zhang, Xiaojun Liu, Yuanchun Ma, Xujun Zhu, Jie Jiang and Wanping Fang

Plants 2026, 15(7), 1071; https://doi.org/10.3390/plants15071071 - 31 Mar 2026

Viewed by 302

Abstract

Achieving rapid and non-destructive assessment of tea quality is essential for intelligent tea production and quality control. In this study, an integrated hyperspectral and deep learning framework was developed to estimate tea quality constituents across seasons and physical states. Samples included field fresh leaves, dried tea leaves, and tea powder, were collected in spring, summer, and autumn. Tea polyphenols and catechins were predicted using original reflectance, harmonic features, and wavelet features fused into multi-domain indices. Extreme gradient boosting, Gaussian process regression, and convolutional neural networks (CNN) were systematically compared to construct the quality estimation models. The result showed that three-feature indices consistently outperformed two-feature indices, yielding R² from 0.48 to 0.71. CNN achieved the best overall performance among the three modeling approaches, with its optimal accuracy obtained for tea powder samples in autumn, yielding R² values of 0.81 and 0.76 for tea polyphenols and catechins, respectively. This framework provides an accurate, non-destructive tool for tea quality evaluation and traceability, offering technical support for intelligent agriculture and quality control across the tea industry chain. Full article

(This article belongs to the Special Issue Machine Learning for Plant Phenotyping in Crops)

►▼ Show Figures

Figure 1

7 pages, 904 KB

Open AccessProceeding Paper

Predictive Modeling of Malaria Risk Using the Nigerian Demographic and Health Survey Data

by JohnPaul C. Ugwu, Thecla O. Ayoka, Charles O. Nnadi and Wilfred O. Obonga

Eng. Proc. 2026, 124(1), 98; https://doi.org/10.3390/engproc2026124098 - 31 Mar 2026

Viewed by 210

Abstract

Malaria continues to pose a significant public health challenge in Nigeria, yet there has not been much research utilizing machine-learning techniques to forecast malaria risk. This study developed a machine-learning model that predicts malaria risk by leveraging demographic, environmental, and GPS data from the Nigerian Demographic and Health Survey (DHS) covering the years 2000 to 2020. The dataset was pre-processed and split into a training set (with 406 respondents) and a test set (with 102 respondents). Random Forest (RF), Gradient Boosting (GB) and Linear Regression (LR) algorithms were employed to assess their predictive performance. The RF stood out with the best accuracy, achieving the lowest mean squared error (MSE = 0.0053) and the highest coefficient of determination (R² = 0.6364). Thus, RF was recognized as the most effective model for predicting malaria risk. The regression equation with positive coefficients (like population density = 0.0141, travel time = 0.0019, minimum temperature = 0.0082, temperature in January = 0.0265, and dry land surface temp = 0.0368) indicate that higher feature values are associated with increased malaria prevalence, while negative coefficients (such as rainfall = −0.0122, nightlights composite = −0.03, potential evapotranspiration = −0.09 and insecticide treated nets = −0.02) suggest that as the feature increases, the prevalence decreases. This study underscores the potential of the RF approach in improving early predictions of malaria risk and can guide targeted interventions to control malaria in areas at high risk. Full article

(This article belongs to the Proceedings of The 6th International Electronic Conference on Applied Sciences)

►▼ Show Figures

Figure 1

30 pages, 4624 KB

Open AccessArticle

Distribution Characteristics and Hazard Assessment of Ground Collapse in the Mining Activity Areas of the Turpan–Hami Basin

by Tao Wang, Chao Jin, Ning Liang, Yongchao Li, Shuaihua Song, Jingjing Ying, Yiqing Zhao and Bowen Zheng

Appl. Sci. 2026, 16(7), 3354; https://doi.org/10.3390/app16073354 - 30 Mar 2026

Viewed by 312

Abstract

The Turpan–Hami Basin, a critical energy hub in northwestern China, is plagued by frequent ground collapses induced by extensive mining over karst geology, threatening ecology and safety. Current hazard assessment methods, mainly single linear or traditional machine learning models, fail to capture the complex nonlinear interactions inherent to this coupled geo-mining environment. This study addresses this gap by establishing a multi-dimensional “Geology-Mining-Hydrology-Environment” index system comprising 14 critical factors—including lithology, goaf distribution, mining intensity, and their interaction terms. A coupled gradient boosting decision tree and logistic regression (GBDT-LR) model, optimized for the multi-factor coupling characteristics of ground collapse in arid mining basins, was applied for the hazard assessment. The results reveal a distinct spatial pattern of “core agglomeration with multi-level gradient differentiation.” Extremely high-hazard areas, covering 9.21% of the area, are concentrated in the core mining areas northwest of Turpan and southwest of Hami, while high-hazard areas (4.63%) form surrounding belts. The GBDT-LR model (AUC = 0.871) demonstrated significantly superior performance over a single logistic regression model (AUC = 0.813), proving its enhanced capability to identify high-hazard areas by modeling complex factor interactions. This work provides an essential scientific foundation for implementing zonal hazard management and prioritizing disaster prevention projects in key areas of the basin. Full article

(This article belongs to the Special Issue Remote Sensing Technology in Landslide and Land Subsidence—2nd Edition)

►▼ Show Figures

Figure 1

20 pages, 3507 KB

Open AccessArticle

Optimizing Data Preprocessing and Hyperparameter Tuning for Soil Organic Carbon Content Prediction Using Large Language Models: A Case Study of the Black Soil and Windblown Sandy Soil Regions in Northeast China

by Hao Cui, Xianmin Chang and Shuang Gang

Appl. Sci. 2026, 16(7), 3349; https://doi.org/10.3390/app16073349 - 30 Mar 2026

Viewed by 201

Abstract

To address the current issues in soil organic carbon (SOC) content prediction where data preprocessing relies on expert experience to formulate fixed rules, resulting in a lack of uniform standards and insufficient consideration of regional soil heterogeneity; while hyperparameter tuning faces problems of high computational costs and excessively long runtimes, this study proposes an intelligent modeling workflow driven by Large Language Models (LLM). This workflow focuses on optimizing two key aspects of SOC Random Forest modeling: data preprocessing and hyperparameter tuning. Results: The LLM-defined rules achieved sample retention rates of 55.33% and 61.90% in the two regions, respectively, showing more significant differences compared to traditional hard-coded rules (56.2% and 59.3%), and the mean soil organic carbon content deviations (30.27% and 20.05%) were both lower than those of traditional hard-coding. At the same time, the mean soil organic carbon content values in both regions closely matched the effectiveness of other methods, indicating that the large language model has effectively captured regional soil differences. With only a single evaluation of hyperparameter optimization, the adaptive model achieved test set R² values of 0.394 and 0.694 in the black soil region and the aeolian sandy soil region, respectively, with root mean square error values of 8.76 g/kg and 6.07 g/kg—its performance is comparable to that of Grid Search and Random Search, while computational efficiency improved by over 95%. Performance comparisons with eXtreme Gradient Boosting (XGBoost) and Partial Least Squares Regression (PLSR) show that the LLM-optimized Random Forest achieved R² = 0.394 and RMSE = 8.76 g/kg in the black soil region, and R² = 0.694 and RMSE = 6.07 g/kg in the windblown sandy soil region, demonstrating practical application value. Full article

(This article belongs to the Section Environmental Sciences)

►▼ Show Figures

Figure 1

Show export options Show export options

Select all

Export citation of selected articles as:

Error

Oops... you haven't selected anything for export.

Displaying article 1-50 on page 1 of 42.

Go to page 1 2 3 4 5

Search Results (2,100)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI