*Article* **Evaluation of Soil Nutrient Status Based on LightGBM Model: An Example of Tobacco Planting Soil in Debao County, Guangxi**

**Zhipeng Liang 1, Tianxiang Zou 1, Jialin Gong 1, Meng Zhou 1, Wenjie Shen 1,2,3,\*, Jietang Zhang 4, Dongsheng Fan <sup>5</sup> and Yanhui Lu <sup>5</sup>**

	- <sup>4</sup> Guangdong Vcarbon Testing Technology Co., Ltd., Qiangyuan 511500, China

**Abstract:** Soil nutrient status is the foundation of agricultural development. Exploring the features of soil nutrients and status evaluation can provide a reference for the development of modern agriculture. LightGBM is an optimization algorithm based on the boosting framework, which uses histograms to improve the accuracy of the model. Based on the construction of the LightGBM model, the main nutrient features and status of tobacco planting soil were analyzed in seven towns in Debao County, Guangxi Province, namely Yantong Town, Longguang Town, Najia Town, Zurong Town, Du'an Town, Dongling Town and Jingde Town. The confusion matrix results show the accuracy of the LightGBM model is 94.2%, and the eigenvalue analysis shows that the available potassium (K) contributes the most to the nutrient status. The pH value of soil ranging from 6.1 to 7.8 is favorable for tobacco growth, and the contents of soil organic matter, total nitrogen (N), available phosphorus (P), exchangeable calcium (Ca) and exchangeable magnesium (Mg) are at the appropriate level. Available potassium (K) and available zinc (Zn) are at a high level, but available boron (B) is slightly insufficient. The nutrient status of 10% of soil is at an extremely high level, and about 81.03% of soil is medium level or above. The LightGBM model has high reliability in the automatic evaluation of soil nutrient status, which not only can accurately monitor the soil nutrient status but also reflects the correlation and importance of nutrient factors. Therefore, the LightGBM model is significant for guiding soil cultivation and agricultural production.

**Keywords:** LightGBM model; tobacco planting soil; confusion matrix; eigenvalue analysis; nutrient features; nutrient status

## **1. Introduction**

Soil, as a basic environment for crop growth and an important means of agricultural production, is the primary guarantee for the sustainable development of the biosphere [1,2]. The abundance or shortage of soil nutrients greatly affects the quality of crops, which is one of the important factors for the development of planting agriculture [3]. Influenced by landform, climate, altitude and so on, soil nutrients are diverse in different regions [4]. The level of soil nutrients is not only affected by the independent role of nutrient factors but also depends on the comprehensive coordination of various nutrient factors [5]. Therefore, exploring the comprehensive evaluation of soil nutrient status can lead to a deeper understanding of the current nutrient features of soil, which has important guiding significance for farming and fertilization in agricultural areas.

According to previous studies, the evaluation of soil nutrient features and status is mainly based on the comprehensive evaluation of nutrient factors in the study area [6,7]. However, due to the different locations, soil texture, hydrological conditions and suitable crop types of various cultivated land soils, there are no standard methods to evaluate

**Citation:** Liang, Z.; Zou, T.; Gong, J.; Zhou, M.; Shen, W.; Zhang, J.; Fan, D.; Lu, Y. Evaluation of Soil Nutrient Status Based on LightGBM Model: An Example of Tobacco Planting Soil in Debao County, Guangxi. *Appl. Sci.* **2022**, *12*, 12354. https://doi.org/ 10.3390/app122312354

Academic Editors: Dibyendu Sarkar and Andrea L. Rizzo

Received: 21 September 2022 Accepted: 23 November 2022 Published: 2 December 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

131

the soil nutrient status [8]. The commonly used methods include principal component analysis, cluster analysis, the fuzzy mathematical model membership function, Nemero comprehensive index and gray correlation analysis [9].

In recent years, machine learning as a new subject has received wide attention in various fields. As an extension of applied statistics, machine learning is very suitable for the application and research of agronomy and geosciences [10]. For example, random forest and XGBoost in integrated algorithms are often used to solve classification and regression problems in geochemical research due to their good performance [11,12]. Tian et al. [13] established a random forest model for automatic evaluation of soil nutrient status, and the results are objective and accurate. Tong et al. [14] found that the results of risk assessment and prediction are more accurate than the traditional research methods in the risk assessment of waterlogging in the central cities of the Yangtze River Delta based on the XGBoost model. LightGBM is the Microsoft's latest developed source framework, which uses the histogram decision-tree algorithm and is regarded as an improved version of XGBoost. Compared with previous models, LightGBM has the advantages of fast training speed, high accuracy, less memory and more objective training results, which are mainly used to deal with the classification and regression problems of data analysis [15,16]. Therefore, applying the LightGBM model to the research of soil nutrient status can obtain more objective and accurate evaluation results.

Debao County is one of the main tobacco planting areas in Baise City, which is the largest tobacco planting area in Guangxi Province, with an area of about 8167 ha, accounting for 78% of the tobacco planting area in the province [17,18]. At present, there are few in-depth studies on the evaluation of soil nutrient status and the status of soil nutrient dynamics in Debao County, and the main research methods are traditional and lack innovation. The application and research of the new method of machine learning can evaluate the nutrient abundance and deficiency of local tobacco planting soil more accurately and objectively, which is conducive to a better local understanding of the current tobacco planting soil nutrient status and provides the corresponding method reference for regional soil nutrient status evaluation research, so as to promote the development of agriculture.

In this study, the data of nine nutrient factors were preprocessed through principal component analysis in tobacco planting soil of seven towns in Debao County, namely Yantong Town, Longguang Town, Najia Town, Zurong Town, Du'an Town, Dongling Town and Jingde Town, and then used as test set to build the LightGBM model. The feasibility of the LightGBM model was proved using the confusion matrix, and the important differences between diverse soil nutrient factors were obtained with eigenvalue analysis. Through the classification and prediction of the LightGBM model, the nutrient status of tobacco planting soil was evaluated automatically in the study area. Therefore, using the LightGBM model to study the nutrient status of tobacco planting soil can provide some scientific reference for the improvement of soil fertility in the local tobacco industry and the rational layout of tobacco planting.

## **2. Materials and Methods**

## *2.1. Experimental Sites*

The Debao County (106◦10 –107◦00 E, 23◦00 –23◦40 N) is located in the southwest of Baise City, Guangxi Province, with an altitude of 200–1000 m. Its climate is warm and wet, and the hydrothermal and sunshine conditions are good across four seasons, which is very suitable for producing high quality tobacco. The local average annual rainfall is about 1462.5 mm, the average annual temperature is about 19.5 ◦C and the average annual sunshine duration is about 1325 h [19]. The soil texture types are mainly clay and loam in the study. The vast area of cultivated land is conducive to agricultural development [20].

## *2.2. Data Source*

According to the planting situation of the tobacco planting industry in the study area, 290 soil samples were collected from seven towns in Debao County, namely Yantong Town, Longguang Town, Najia Town, Zurong Town, Du'an Town, Dongling Town and Jingde Town (Figure 1). The samples were collected according to the *Technical Specification for Soil Testing and Formulated Fertilization* [21]. Among them, every 50 acres of contiguous tobacco fields was taken as a sampling unit, and the 's' shape distribution method was adopted for sampling. The isolated and small tobacco fields that are not connected were used as sampling units, and the 'plum blossom' point distribution method was adopted for sampling. During soil collection, the topsoil of 0–20 cm in the tillage layer was selected, and each sampling unit had 5–8 sampling points. After natural air drying, impurity removal, grinding, screening and other steps, the nutrient factor indexes were determined and analyzed.

**Figure 1.** Spatial distribution of tobacco planting soil sampling location in Debao County.

## *2.3. Evaluation Index and Measurement*

Based on the commonly used indicators for soil nutrient status evaluation, nine nutrient factors were selected in this study to assess the soil of Debao County, namely pH value, organic matter, total N, available K, available P, exchangeable Ca, exchangeable Mg, available B and available Zn, for content detection and analysis. All indicators were determined in strict accordance with standard methods (Table 1). According to the *Integrated Management of Tobacco Planting Soil and Tobacco Nutrients in China* [22], the grading standard of abundance and deficiency of different nutrient factors is shown in Table 2.


**Table 1.** Determination indexes and methods of soil nutrients.

**Table 2.** Evaluation criteria of abundance and deficiency for nutrients in tobacco planting soil.


## *2.4. Research Methods*

## 2.4.1. LightGBM

LightGBM is a new optimization model algorithm based on the GBDT framework launched by Microsoft in 2017. It is an upgraded version of XGBoost, with more efficient parallel training, lower memory consumption and more accurate results [30]. LightGBM adopts the histogram decision-tree algorithm, which can convert a weak learner into a strong learner. In the continuous combination of multiple groups of tree models, the calculation complexity is reduced by making use of histogram difference so that the result is a high-quality tree, which can be used as a classification and prediction model [31].

#### 2.4.2. LightGBM Model Construction

The LightGBM model training is divided into five steps, namely data collection, feature engineering, model training, cross validation and accuracy evaluation [32]. The data collection is mainly the experimental data of nine nutrient factors, namely pH value, organic matter, total N, available K, available P, exchangeable Ca, exchangeable Mg, available B and available Zn in the soil of the tobacco growing area. The data were preprocessed, and the model database was established after screening and eliminating the abnormal values.

Feature engineering is an important part of the LightGBM model (Figure 2) construction. It is mainly used to classify sample data through feature values that can reflect the nature of classification. In addition, model training and cross validation mainly optimize the model through continuous learning and training. After that, the accuracy of the sample set and the test set was evaluated, and the results were output.

The comprehensive analysis of nine nutrient factors, namely pH value, organic matter, total N, available K, available P, exchangeable Ca, exchangeable Mg, available B and available Zn, in 1038 sample points in Baise, Hechi and Hezhou City of Guangxi Zhuang Province was carried out. The data were preprocessed with principal component and cluster analysis and were used as the training data of the model. The 290 samples in the study area are the test data. Since the original data were divided into five grades according

to the nutrient status of tobacco planting soil based on the class average method in cluster analysis during preprocessing, the class labels of the confusion matrix are also displayed as five grades during model training [33].

**Figure 2.** The LightGBM modeling process framework.

## **3. Results and Analysis**

*3.1. Analysis of Nutrient Features*

In Table 3, the pH of tobacco planting soil in seven towns of Debao County is generally between 6.1 and 7.8, with an average value of about 6.9, which is generally weak acidity. The content of soil organic matter is between 12.1 and 49.3 g/kg, with an average value of 27.5 g/kg, which is moderate, while the average content of organic matter in Najia Town, Zurong Town and Du'an Town is 38.0, 33.5 and 32.5 g/kg, respectively, which is within the high range. The average contents of soil total N, available K and available P are 1250, 270.5 and 23.0 mg/kg. Among them, the content of soil total N and available P is in a moderate state, and available K is in a high state.

Figure 3 shows that the exchangeable Ca content in the tobacco planting soil in seven towns of Debao County is relatively high, with more than 47% of the soil samples containing >4 cmol/kg and 22% of the soil samples containing >10 cmol/kg, which is extremely high. The exchangeable Mg content of soil is good, and the content of most samples is between 0.8 and 1.6 cmol/kg, which is basically within the suitable range for planting high-quality tobacco. The number of soil samples with available B content lower than the B deficiency threshold (0.5 mg/kg) [34] accounted for about 35%, and the number of samples with available Zn content higher than the enrichment threshold (1.0 mg/kg) accounted for about 99.5%. Therefore, the soil available Zn content is very high, but the available B content is slightly insufficient.


**Table 3.** Soil pH values, organic matter, total N, available K, available P content in seven tobacco planting towns of Debao County.

**Figure 3.** Distribution of exchangeable Ca, exchangeable Mg, available B and available Zn in tobacco growing soil.

#### *3.2. Pearson Correlation Analysis*

The quality of soil nutrient status is a comprehensive reflection of various nutrient factors, and there is a certain correlation between different soil nutrient factors [8,35]. Pearson correlation analysis of soil nutrient factors of the nine tobacco planting areas can lead to a better understanding of the relationship between nutrient factors. The results (Table 4) show that there is a significant positive correlation between pH value and organic matter, total N, exchangeable Ca and exchangeable Mg and a significant negative correlation with available B. Among them, the correlation coefficient between pH value and exchangeable Ca reached 0.613. It can be seen that the acidity and alkalinity of soil significantly affected the concentration of calcium ions. Where the pH value is high, the exchangeable calcium content is high.

In addition, the positive correlation coefficient between soil organic matter and total N is 0.588, and there is a significant positive correlation between the two. The main reason is that the N content in the soil mainly exists in the form of organic N, and the organic N mainly comes from the inorganic degradation of organic matter.


**Table 4.** Correlation analysis of tobacco planting soil.

Note: \* *p* < 0.05, \*\* *p* < 0.01.

## *3.3. Evaluation of Soil Nutrient Status by LightGBM Model*

## 3.3.1. Confusion Matrix

A confusion matrix, also known as an error matrix, is a standard format for expressing accuracy evaluation in integrated algorithm model and is also a method for judging the classification of algorithm model [36]. It is a matrix in which rows represent actual classes and columns are prediction classes. In the calculation process of the LightGBM model, 1038 samples and 9 nutrient factors in Baise, Hechi and Hezhou City were taken as a data set, and 70% of the samples were taken as the training sample set and 30% as the validation sample set according to the ratio of 7:3, for multi-classification prediction.

As shown in Figure 4, 127 of 134 samples in the class I nutrient state were predicted to be true, with a validation rate of 94.8%. 34, 37, 61 and 35 samples in class II, III, IV and V were predicted to be true, with a validation rate of 91.9%, 92.5%, 92.4% and 94.6%, respectively. Of the 312 validation sample sets in the five nutrient status levels, 294 were verified as true categories, with an overall accuracy of 94.2%. The model has a small calculation error and high classification accuracy.

**Figure 4.** Prediction results of the LightGBM model confusion matrix.

## 3.3.2. Eigenvalue Analysis

Eigenvalue analysis is an important part of the feature engineering operation in the LightGBM model, and it can directly reflect the own features of the independent variables in classifying the samples. Through its own weight calculation, it can obtain its own importance to the classification results, which is called the importance of the independent variables [32]. In the process of the LightGBM model operation, the eigenvalue is the independent variable, that is, the nine nutrient factors in the nutrient status evaluation. The importance ranking is shown in Figure 5. The larger the value, the higher the importance.

In Figure 5, the eigenvalue score of available K is more than 1000, which is much higher than the other eight nutrient factors, indicating that it has the strongest importance and the greatest contribution to the evaluation and grading of the nutrient status of tobacco planting soil. The contribution degree from high to low is available K, available P, organic matter, total N, available B, pH value, available Zn, exchangeable Ca and exchangeable Mg, which indicates that these nutrient factors have certain differences in the evaluation of the nutrient status of tobacco planting soil and that it is easy to identify and classify the models. The top six characteristic values, available K, available P, organic matter, total N, available B and pH value, are also important nutrient factors for the comprehensive evaluation of nutrient status of tobacco planting soil. For example, when Guo [37] studied the nutrient status of tobacco planting soil in the Erhai Lake Basin, the nutrient factors selected were available K, available P, organic matter, total N and pH value. Mu [38] showed that available B is one of the trace elements necessary for crop growth, and its abundance affects the development and quality of tobacco growth. Therefore, according to the above, the LightGBM model is consistent with the actual research results.

In addition, the contribution rate of available Zn, exchangeable Ca and exchangeable Mg is low, and the characteristic score is lower than 200. The quality of soil nutrient status is a comprehensive reflection of various nutrient factors in the soil. The LightGBM model is reasonable because of the common existence and mutual influence of nine soil nutrient factors and can obtain more accurate and objective evaluation results of nutrient status.

**Figure 5.** Eigenvalue score of LightGBM model.

## 3.3.3. Nutrient Status Evaluation

A total of 290 tobacco planting soil samples including nine nutrient factors in Debao County were taken as the test sample sets to conduct automatic classification of nutrient status evaluation in the LightGBM model. The classification level was consistent with that of model training and was divided into five classes (Table 5). As shown in Table 5, Grade I indicates that the nutrient status of tobacco planting soil is at a very high level, accounting for 10%. Grade II and III indicate that the nutrient status is high and relatively high, accounting for 7.24% and 17.59%. Grade IV indicates that the nutrient status of tobacco planting soil is at the medium level, accounting for 46.2%, which is the largest part of the nutrient status of tobacco planting soil in the town. Grade V indicates that the nutrient status of tobacco planting soil is at the general level, accounting for 18.97%.

Based on the evaluation results of nutrient status applied by the LightGBM model, the distribution of soil nutrient status in seven tobacco planting towns in Debao County can be obtained using the kriging analysis method (Figure 6). In Table 5 and Figure 6, the nutrient status of tobacco planting soil in Yantong Town and Jingde Town among the seven tobacco planting towns in Debao County are at a extremely high level, accounting for 0.69% and 9.31%. The nutrient status of Yantong Town, Du'an Town, Dongling Town and Jingde Town in the county are at a medium level or above, while Longguang Town, Najia Town and Zurong Town are at medium and low level. In general, the soil nutrient status of the seven tobacco planting towns in Debao County is basically at the medium level or above, accounting for 81.03%, and the low level of nutrient status accounts for 18.97%.

**Town Proportion of Different Nutrient Grade (%) I Extremely High II High III Relatively High IV Medium V Low** Yantong 0.69 1.04 4.14 3.79 6.9 Longguang 0 0 0 1.72 2.41 Najia 0 0 0 3.45 2.76 Zurong 0 0 0 8.28 3.79 Du'an 0 1.03 1.72 3.1 0 Dongling 0 0.34 1.04 1.72 0 Jingde 9.31 4.83 10.69 24.14 3.11 Total 10 7.24 17.59 46.2 18.97

**Table 5.** Evaluation of nutrient status based on the LightGBM model.

**Figure 6.** Spatial distribution of soil fertility.

### **4. Discussion**

Soil pH directly affects the growth of plants, and the soil with weak acidity is more suitable for the growth of tobacco [39,40]. The pH of tobacco planting soil in seven towns of Debao County is generally at an appropriate level, which is conducive to the development of tobacco agriculture. The contents of soil organic matter, total N, available P, exchangeable Ca and exchangeable Mg and other nutrient factors are moderate. The average content of soil available K (270.5 mg/kg) is almost five times the average level of the national tobacco planting soil (57.5 mg/kg). As tobacco is a typical K-loving crop, K has an important impact on the growth, development and quality of flue-cured tobacco [41]. Compared with the national tobacco planting soil, the high soil K in the tobacco area of Debao County is more conducive to producing high-quality tobacco. The effective zinc content of the tobacco planting soil in the town is at a very high level, and the effective B content is slightly insufficient. The application of trace element fertilizer B can be slightly increased to improve the effective B content of the local tobacco planting soil.

Based on the LightGBM model, the nutrient status of tobacco planting soil is evaluated. The accuracy of the model is 94.2%, the classification accuracy is high and the reliability is strong. The feature value of available K is the largest, its contribution rate to the evaluation of soil nutrient status is the largest and its importance is the strongest in the tobacco planting area. This also corresponds to the high content of available K in the tobacco planting soil in the study area. In the evaluation of nutrient status, Yantong Town, Du'an Town, Dongling Town and Jingde Town have good nutrient status, and the county is generally at or above the medium level, accounting for 81.03%, which has good tobacco planting value and is conducive to the development of tobacco industry. Some soil nutrients are slightly insufficient. In the future, we can improve the soil nutrients by formulating reasonable and scientific fertilization measures to provide more fertile soil conditions for the planting of tobacco crops.

Applying the integrated algorithm to the evaluation of soil nutrient status, the multiclassification nonlinear mapping relationship between nutrient factors and nutrient status can be established. The regular and clear evaluation indexes improve the sorting performance of training samples and make the final sorting prediction more scientific. In this study, LightGBM, as an upgraded and improved version of XGBoost, obtained more accurate classification and recognition ability through its own full training. Automatic data processing can make the evaluation results of nutrient status of tobacco planting soil in the study area more objective and accurate. In addition, due to the reliability and flexibility of the LightGBM model, it can be widely used in the evaluation of various types of soil nutrient status and fertility, in order to provide more innovative research methods and means for the development of new agriculture in the 21st century. At the same time, the LightGBM model also provides a new and innovative method for the research of other issues outside the agricultural field.

## **5. Conclusions**

There is weak acidity in the tobacco planting soil in Debao County. The content of soil organic matter, total N, available P, exchangeable Ca and exchangeable Mg are at the appropriate level, and available K and Zn are at a high level, but the available B is slightly insufficient. Therefore, the content of soil available B can be adjusted by artificial means such as fertilization.

The rate of the contribution of available K to the evaluation of the soil nutrient status of tobacco planting is the largest in the Debao County. The nutrient status of tobacco planting soil is at a great level in Yantong Town, Du'an Town, Dongling Town and Jingde Town, while Longguang Town, Najia Town and Zurong Town are at a medium or low level. In general, the whole county has reached a medium level and above, which is conducive to the planting of tobacco crops.

The application of the LightGBM model to the evaluation of soil nutrient status is accurate, reliable and objective. Due to the good stability and wide adaptability of the model, the LightGBM can be widely used to solve other problems in agriculture and geosciences so that researchers can obtain more accurate and realistic results.

**Author Contributions:** Z.L.: Writing, editing and figure preparation; T.Z., M.Z. and J.G.: Table preparation; W.S.: Funding acquisition, review and supervision; J.Z., D.F. and Y.L.: Review and editing. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Science and Technology Program of Guangzhou, grant numbers 202002030184 and 201804010190, the National Key Research and Development Program of China, grant number 2022YFF0801201 and the National Natural Science Foundation of China, grant numbers 42072229 and U1911202.

**Conflicts of Interest:** The authors declare no conflict of interest.

## **References**

