1. Introduction
Soil organic carbon (SOC) constitutes an essential element within the global carbon cycle, playing an important role in mitigating climate change, improving soil health, and enhancing agricultural productivity. Quantifying and monitoring SOC content is essential for evaluating soil quality, orienting sustainable land management practices, and achieving international climate change mitigation commitments [
1]. Consequently, SOC mapping has garnered global interest as a means of addressing environmental and food security challenges. Interest in SOC mapping has been particularly pronounced in Africa [
2,
3,
4], which faces a unique combination of challenges and opportunities in soil management. The diverse climates and ecosystems of Africa present a varied soil landscape, where accurate SOC mapping can make a significant contribution to improving agricultural resilience, food security, and climate change adaptation efforts [
5]. Furthermore, digital mapping of SOC in a sub-Saharan country like Senegal can make a significant contribution to achieving several Sustainable Development Goals (SDGs).
In this context, the integration of machine learning (ML) algorithms with Earth observation (EO) data has been recognized as a powerful approach for improving the accuracy and efficiency of SOC prediction and mapping [
6,
7]. According to Nenkam Mentho et al. [
8], among 110 studies conducted in Africa, 34 and 6 specifically focused on SOC and soil organic matter (SOM), respectively, both with and without the consideration of other soil attributes. For instance, Hengl et al. [
9] demonstrated the utility of the Africa Soil Information Service (AfSIS) in conjunction with Moderate Resolution Imaging Spectroradiometer (MODIS) data for the mapping of various soil properties, including SOC and pH, at a resolution of 250 m. Utilizing the same data source, Vågen et al. [
5] employed a Random Forest model for SOC mapping across the African continent. Furthermore, Hengl et al. [
10] generated 30 m resolution pan-African maps detailing various soil nutrients, such as SOC, pH, total nitrogen (N), phosphorus (P), and potassium (K), among others, through the combination of diverse EO datasets and ensemble ML algorithms. Bouasria et al. [
11] explored the feasibility of utilizing pan-sharpened Landsat-8 imagery (15 m resolution) for SOM mapping via multiple linear regression and artificial neural networks. Similarly, Bouslihim et al. [
12] employed a Random Forest approach for SOM mapping using Landsat-8 imagery at a 30 m resolution.
Recent advances in remote sensing technologies have expanded the opportunities for digital soil mapping (DSM). Sentinel-1 (C-band synthetic aperture radar) and Sentinel-2 (multi-spectral optical data) satellites can provide unprecedented opportunities for detailed and frequent monitoring of the Earth’s surface, including soil properties. While Sentinel-2 provides high-resolution optical images useful for capturing surface features and vegetation indices, Sentinel-1 radar data offer advantages by penetrating cloud cover and providing information on soil moisture, which is closely linked to SOC content [
13,
14]. Within the African context, out of 110 studies, 11 have utilized Sentinel-2 data for DSM purposes, yet only 2 have yielded SOC maps at a 10 m resolution [
8]. In the first study, Mponela et al. [
15] used Sentinel-2 data to determine soil fertility (including SOC, NPK, etc.) for a 0.45 ha area in Malawi. Additionally, Flynn et al. [
16] predicted soil particle size distribution and SOC content at a 10 m resolution over a 366 ha area in South Africa. Despite the potential, the application of Sentinel data in Africa for SOC mapping remains underexploited. Predominantly, global studies have employed Sentinel data from a single date [
17,
18,
19,
20,
21]. However, a limited number of investigations have harnessed multi-temporal data from Sentinel-1 or Sentinel-2 for enhanced analysis [
22,
23,
24].
This study investigates several hypotheses related to DSM for SOC prediction. Firstly, we hypothesized that the combined use of multi-temporal Sentinel-1 and Sentinel-2 data would outperform the individual use of either data source in predicting SOC content. Secondly, we posited that incorporating topographic features as auxiliary environmental variables would further enhance the accuracy of SOC prediction models. Finally, we anticipated that different machine learning algorithms (RF, SVR, and XGBoost) would exhibit varying performance levels depending on the specific combination of input variables and the chosen scenario. To test these hypotheses, we evaluated the efficacy of these data sources and algorithms across various scenarios, aiming to identify the optimal approach for generating high-resolution SOC maps. This research contributes valuable insights into the synergistic potential of Sentinel data and the role of environmental variables and machine learning in advancing digital soil mapping techniques for SOC prediction. In addition, this paper supports SDG 13 (Climate Action) by providing crucial data for understanding and monitoring carbon sequestration capacities, thus informing climate change mitigation strategies, and SDG 15 (Life on Earth) through its potential to improve soil health, promote sustainable land use practices, and combat desertification, which is particularly important in arid and semi-arid regions. In addition, by enabling better-informed agricultural practices, this research indirectly contributes to SDG 2 (Zero Hunger) and SDG 1 (No Poverty) by improving food security and livelihoods through improved soil fertility and crop yields. Thus, digital mapping of soil organic carbon serves as a multi-disciplinary tool that cuts across various environmental and socio-economic aspects of sustainable development in the context of African countries.
4. Discussion
To thoroughly discuss the findings of this study, three main aspects were considered: (i) feature importance in SOC prediction, (ii) the performance of the various scenarios using Sentinel-1 and Sentinel-2 and topographic data, and (iii) the effectiveness and comparative analysis of the three ML algorithms.
Firstly, the RFE method was used to select the most important variables/features for SOC prediction. For that, 10 variables were identified for Scenarios 1 and 2, 20 variables were identified for Scenarios 3 and 5, and 7 variables were identified variables for Scenario 4. The number of variables for Scenarios 3 and 5 was increased to assess whether the RFE model would extract identical variables from Sentinel-1, Sentinel-2, and topographic data, or if one dataset would predominate over the others. The variables identified as being significant were MNDWI, SAVI, and MTCI, each with more than three variables from different months, indicating their relevance over different time periods. The importance of these variables is explained by the fact that SAVI and MTCI reflect vegetation [
47,
48], which is indirectly correlated with soil health and fertility [
5,
49] and consequently serves as a proxy for soil organic matter content [
50]. This association has been supported by numerous studies that have identified vegetation indices, such as SAVI, NDVI, and others, to predict SOC or SOM [
51,
52,
53,
54,
55,
56]. The link between MNDWI and SOC is more indirect and complex. Similarly, SOC affects soil physical and chemical properties, including color, texture, and moisture retention capacity. These properties can influence soil reflectance characteristics in different spectral bands, including green and SWIR bands, and may indirectly highlight the importance of soil moisture parameters in SOC prediction [
57,
58,
59], as moisture-rich environments can facilitate the preservation and accumulation of organic carbon in soil [
1,
60,
61]. Furthermore, our results align with those of Lu et al. [
62], who highlighted the importance of MNDWI alongside other soil moisture indices such as the Topographic Wetness Index (TWI) for SOC prediction. CI and BI showed a significant contribution to SOC prediction due to their ability to capture variations in soil color, which are often indicative of SOM content and other soil properties [
63,
64]. The correlation between SOC and CI and BI was already highlighted in previous studies, such as Saha et al. [
65], which demonstrated that different spectral color indices, especially CI, are important for SOC prediction and mapping.
The Sentinel-2-derived indices used in Scenario 2 contributed more significantly than the Sentinel-1 dual-polarization indices (VV and VH). This can be attributed to the superior ability of Sentinel-2 variables to predict SOC compared with Sentinel-1, which is reflected in the performance differences between the models. In detail, Scenario 2 showed higher performances for RF (R
2 = 0.49, RMSE = 0.037%) and XGBoost (R
2 = 0.45, RMSE = 0.039%) compared to Scenario 1, for which the RF performance was R
2 = 0.36 and RMSE = 0.042% and the XGBoost performance was R
2 = 0.34 and RMSE = 0.046%. In addition, the combination of the two scenarios resulted in an even higher performance for RF (R
2 = 0.61, RMSE = 0.024%) and XGBoost (R
2 = 0.51, RMSE = 0.028%), with a significant contribution from Sentinel-2 variables. This advantage of Sentinel-2 has been confirmed by various studies, such as Nguyen et al. [
54], who found that SOC prediction performance using Sentinel-2 was superior to that using Sentinel-1, with R
2 values of 0.44 versus 0.25. Zhang et al. [
66] obtained similar results, with an R
2 of 0.47 for Sentinel-2 versus 0.26 for Sentinel-1. In addition, Fatholoumi et al. [
67] and Wang and Zhou [
68] pointed out that the use of multi-temporal variables improved prediction performance due to the dynamic relationship between SOC and vegetation across a longer period compared to using data from a single date. Furthermore, the improvement in performance observed from the combination of the two scenarios was further validated by Zhang et al. [
66], who reported an improvement in accuracy ranging between 2% and 5%. Similarly, Zhou et al. [
69] highlighted that combining Sentinel-1 and Sentinel-2 data led to an increase in SOC prediction accuracy by 5 to 6% and a reduction in error by 5% to 7%. Including topographical features increased the performance of all models, with a significant contribution from elevation, the highest performance being reached by the RF model with an R
2 of 0.7, an RMSE of 0.012%, and an RPIQ of 5.754. The importance and contribution of topographic features were highlighted by Zhou et al. [
70], who showed that elevation, slope, and TWI contributed more than 27% to the model’s explanation. Additionally, Li et al. [
71] showed that relief and TWI were the most important variables controlling SOC. The same was demonstrated by Gibson et al. [
72], indicating that topographic features have an impact on SOC modeling at different resolutions. Furthermore, the same reasoning for grouping environmental covariates was demonstrated by Duarte et al. [
73], based on Landsat-8 and various other covariates, such as climate and topography, and yielded the best results for SOC stocks in forested land.
The comparison of ML algorithms revealed that RF and XGBoost outperformed the SVR model, mainly due to their ensemble nature, which offers greater adaptability in addressing complex, non-linear relationships within data. Across all scenarios, RF and XGBoost consistently demonstrated higher R
2 values compared to the SVR model, indicating a greater proportion of variance explained by the dependent variable, as well as lower RMSE values. These results are also reflected in other studies, such as that of Nguyen et al. [
54], who highlighted that XGBoost and RF surpassed the SVR model in predicting SOC content using Sentinel-1 and Sentinel-2 data, achieving a higher performance with an R
2 value higher than 0.7. Similarly, Siewert [
74] compared various algorithms for SOC prediction and identified a superior performance of RF models over others. Moreover, Zhang et al. [
66] observed that RF could outperform XGBoost when using separate Sentinel data, which is in line with our findings of an RF with R
2 values of 0.61 and 0.7 for Scenarios 3 and 4, respectively, versus R
2 values of 0.51 and 0.64 for XGBoost and 0.38 and 0.56 for SVR. The performance results obtained in the present study are similar to those reported by Pouladi et al. [
75], who used only Sentinel-2, and Nguyen et al. [
54], with R
2 values around 0.72 for RF; however, these values were higher than those obtained in other studies that demonstrated low performance, such as Shafizadeh-Moghadam et al. [
23] and Tajik et al. [
76], with performance being characterized by R
2 values less than 0.5. The low performance in these studies can generally be attributed to factors such as high heterogeneity with an extensive study area size and the low density of sampling points [
70]. In our case, the reasons for the low performance for Scenario 1 and Scenario 3 may be attributed to the low variability in SOC content (min = 0.11%, max = 0.72%), which could introduce complexity into the modeling process [
12]. The SOC distribution also revealed that the XGBoost algorithm predicts a lower SOC value than the RF model. This could reflect more conservative estimation or potential underfitting where the XGBoost model does not fully capture the higher SOC values present in the training data, perhaps due to model complexity or regularization parameters. Clearly, both models have limitations in representing the less frequent, slightly higher SOC values, which were few in the training data. This skew towards lower SOC values is a common problem in machine learning, where model performance is strongly influenced by the distribution of the training dataset. In practical applications, this could potentially mean that areas with naturally higher SOC levels could be underestimated.