1. Introduction
Larch (Larix) species are deciduous conifers of the Pinaceae family, widely distributed in the permafrost and seasonal frost zones of the Northern Hemisphere [
1,
2,
3,
4]. In China, larch is one of the important economic forest species, particularly in the northeastern regions (such as Heilongjiang, Jilin, and Inner Mongolia), where it is planted on a large scale. Its timber is extensively used in construction, papermaking, and furniture manufacturing. Additionally, larch is highly adaptable, grows rapidly, and is commonly used in large-scale shelterbelt construction and barren land afforestation, especially in areas affected by desertification and soil erosion in northern and northeastern China. Despite its high environmental adaptability, cold resistance, and fast growth rate, larch is susceptible to various biotic and abiotic stresses, with shoot blight of larch being a significant stressor impacting this species.
Shoot blight of larch is a fungal disease caused by the infection of
Neofusicoccum laricinum (Sawada) Y. Hattori & C. Nakash in larch trees [
5]. This disease has a long history of invasion in China, with a wide spread and significant harm. It is one of the two major tree diseases under strict management in China and has been listed in the country’s control list of invasive alien species and harmful forest pests [
6,
7]. In China, it mainly affects artificial larch forests. The disease was first reported in Japan in 1939 and spread to China in the early 1970s. Since then, it has affected 12 provinces, including Northeast and North China, with an epidemic area covering more than 500,000 ha, posing a huge threat to ecological security in northern China [
8,
9,
10,
11]. Additionally, the disease has been reported in Russia, North Korea, South Korea, the UK, Canada, and other countries [
12]. This disease is highly contagious and can cause the death of new shoots in larch trees, leading to crown dieback when it recurs year after year. It poses a severe threat to the establishment of larch plantations, especially for young and middle-aged forests aged 6–15 years. Once it invades larch forests, it can deal a devastating blow to tree growth [
8,
10].
In recent years, scholars have conducted extensive research on this disease, most of which originates from China. For example, Bruda et al. used LC-MS/MS and weighted gene co-expression network analysis to investigate farrerol’s effects on
Neofusicoccum laricinum; the research suggests that farrerol enhances disease resistance in larch [
13]. Zhang et al. used the optimized Maximum Entropy (MaxEnt) and the Biomod2 ensemble (EM) model to predict the potential geographic distribution areas of shoot blight of larch in China. They found that about 20% of the total land area in China is a potential suitable distribution zone for the disease [
14]. Zhou et al. also improved the Stacking model, which can predict infected areas of shoot blight of larch in Northeast China [
9]. However, these studies lack an overall explanation of the spatiotemporal patterns of shoot blight of larch and research on its spread and diffusion mechanisms. Therefore, clarifying the spatiotemporal patterns of the disease’s occurrence and development in China and identifying the key factors that influence these patterns are crucial for understanding the disease’s occurrence patterns and for conducting disaster prediction and guiding production control measures.
A suitable external environment is a key factor influencing whether fungi can reproduce and spread widely [
15]. Temperature and humidity can significantly affect fungal reproduction, and forests with high canopy density also favor the occurrence and development of tree diseases [
8]. The conidia of the pathogen causing shoot blight of larch spread primarily through rainwater splash, which is significantly influenced by precipitation and wind speed [
10]. On the other hand, the long-distance spread and diffusion of plant pests and diseases are believed to be closely related to human activities. Increasingly frequent trade of agricultural and forestry products has made the movement of plants and plant products between regions more active [
16]. The transplantation of larch seedlings and the transportation of felled larch branches and other tree products can introduce pathogens into new areas. Once the pathogen establishes itself in the new area, it continues to spread through natural transmission and human planting of infected plants [
12]. This is suspected to be the primary method of long-distance spread of the disease. However, the long-distance spread of shoot blight of larch remains a hypothesis, lacking sufficient evidence for confirmation.
Random Forest is a classic machine learning algorithm known for handling large datasets. It introduces randomness into the algorithm, which reduces the likelihood of overfitting compared to typical machine learning algorithms [
17]. In addition to making predictions, it can rank the importance of various influencing factors and is widely used in fields like ecological protection and environmental studies [
18,
19]. Despite its strength, Random Forest results often lack sufficient interpretability, and it does not directly reveal interactions between variables. Geo Detector is a statistical method used to analyze spatial heterogeneity and spatial differentiation mechanisms, primarily to explore the relationship between the spatial distribution of a geographic phenomenon and its potential driving factors. It is particularly suited for analyzing spatiotemporal patterns, especially in fields such as geography, epidemiology, and ecology [
20,
21,
22]. Geo Detector reveals differences between regions by detecting spatial heterogeneity between variables based on geographic information. In recent years, Geo Detector has achieved significant success, especially in analyzing the spatial distribution characteristics of diseases, the impact of environmental and socio-economic factors, multi-factor interaction analysis, and predicting disease spread and helping to formulate strategies [
23,
24]. Combining Geo Detector with Random Forest can provide a deeper and more accurate understanding of spatial analysis of influencing factors.
Therefore, this study uses methods like Geo Detector and Random Forest, focusing on aspects such as the occurrence of shoot blight of larch and the infection process of the pathogen, which have not been considered in previous studies. The research will take as an example all the districts and counties in China where shoot blight of larch is present. By applying new computer modeling techniques, the study aims to comprehensively analyze 18 potential environmental conditions that may influence the spread of this disease. The goal is to explain the true causes of shoot blight of larch on a broader scale, predict the possibility of future disease outbreaks, and propose effective control measures. The main research contents of this paper include the following: (1) a detailed description of the spatiotemporal pattern of the spread and diffusion of shoot blight of larch in China over the past 50 years; (2) clarification of the mechanisms of long-distance spread of shoot blight of larch; and (3) exploration of the combined effects of multiple factors on the spread and diffusion of shoot blight of larch.
2. Materials and Methods
2.1. Focal Species
The pathogen causes shoot blight symptoms by infecting and damaging the tree, and in severe cases, this can lead to stunted growth or even death of the tree [
25]. Its distribution is mainly concentrated in temperate regions, particularly in northeastern China, the Russian Far East, and parts of North America, often occurring in areas with dense larch populations [
12]. The pathogen spreads through conidia and asexual spores, which are typically formed under moist conditions and dispersed by wind. Higher humidity and precipitation facilitate the release and spread of the spores. The pathogen can also spread to surrounding larch trees via rainwater and wind, leading to rapid regional dispersal.
During the autumn and winter months, the pathogen usually forms large numbers of conidia at the infected lesions, which are released again in spring under warm and humid conditions, starting a new infection cycle. The adaptability of the shoot blight of larch pathogen is reflected in the pathogen’s high dependence on moist environments and its ability to survive on larch trees for extended periods [
26]. With climate change, the spread of the pathogen may expand, particularly in warmer and more humid regions [
14]. The ecological function of the shoot blight of larch pathogen is mainly as a pathogen affecting the growth and reproduction of larch trees. Its widespread dissemination may lead to the death of larch trees in forests, thereby impacting the overall structure and function of the ecosystem.
2.2. Materials
This study focuses on all counties with larch distribution within the provinces in China where shoot blight of larch occurs. The study area is located in northern China (31°42′–53°33′ N, 91°20′–135°2′ E) and includes 488 counties across 12 provinces: Hebei, Shanxi, Inner Mongolia, Liaoning, Jilin, Heilongjiang, Shandong, Henan, Shaanxi, Gansu, Qinghai, and Ningxia Hui Autonomous Region. The region spans a vast area with significant latitudinal and longitudinal variations, encompassing diverse topographies such as plateaus, plains, hills, and mountains. The climate types are equally varied, including temperate monsoon, subtropical monsoon, and temperate continental climates.
The baseline county-level data on shoot blight of larch were provided by the National Forestry and Grassland Administration’s Biological Disaster Prevention and Control Center. These data include the annual number, names, and administrative codes of newly affected counties from 1973 to 2021. Larch distribution data were extracted and organized from the “2020 Forest Resource Management Map” by selecting the “dominant tree species” field for “larch species”. This includes species such as
Larix gmelinii var.
principis-rupprechtii,
Larix olgensis,
Larix gmelinii, and
Larix kaempferi. These data cover counties with larch distribution across the 12 provinces, including their names and administrative codes [
27].
Previous studies have shown that the occurrence of shoot blight of larch is significantly correlated with factors such as temperature and humidity, rainfall in June–August, wind speed in May–June, as well as stand age, topographic slope, stand density, tree species, and soil type. In this study, considering the systematic nature of the influencing factors and the availability of data, a total of 18 variables were selected as the influencing factors for the research [
28,
29]. Meteorological and surface data were obtained from the National Earth System Science Data Center (
http://www.geodata.cn (accessed on 28 February 2025)). Monthly precipitation, temperature, and wind speed data from 1901 to 2022 at a 1 km resolution were downloaded from the database. Canopy closure data were derived from the 2020 GLASS (Global Land Surface Satellite) product provided by the same data center [
30,
31]. Canopy closure refers to the extent to which the tree canopy in a forest covers the ground, serving as an indicator of stand density. The calculation method is typically the ratio of the vertical projection area of the canopy to the forest land area, with no units. It is usually expressed as a decimal, where complete ground coverage is represented by a value of 1.0.
2.3. Spatial Statistical Analysis
2.3.1. Standard Deviation Ellipse
The standard deviation ellipse is a classic spatial statistical method used to reveal the spatial distribution characteristics of geographic features. It measures the direction and distribution of a dataset [
32]. The basic parameters of the standard deviation ellipse, such as the centroid, rotation angle, major axis, minor axis, and area, can be calculated using the “Geographic Distribution Measurement” tool in ArcGIS 10.8 (Esri, Redlands, CA, USA). This method helps to depict the centrality, extent, orientation, and spatial morphological characteristics of the infected counties of shoot blight of larch.
2.3.2. Global Moran’s Index
Moran’s index is a spatial autocorrelation measurement method used to describe the degree of dispersion or aggregation among all spatial objects in a study area, as well as the average spatial association, spatial distribution patterns, and significance levels [
33]. To assess the spatial autocorrelation of shoot blight of larch across the entire study area, the research presented in this paper utilizes the global Moran’s index. The calculation formula is as follows:
where
is the number of counties in the study area,
and
represent the
-th and
-th study areas,
is the value of the research target in the study area,
is the mean of
,
is the spatial weight, and
is the sum of the spatial weights. The value of
ranges from [−1, 1]:
> 0 indicates that similar observations are spatially clustered,
< 0 indicates that dissimilar observations are spatially clustered, and
= 0 indicates no spatial autocorrelation, representing a random spatial distribution in the study area.
2.3.3. Kernel Density Estimation
Kernel density estimation is a non-parametric method used to estimate the probability density function of a random variable [
34]. It is used to calculate the density in the surrounding areas of the infected counties of shoot blight of larch, identifying the core distribution characteristics of the infected regions. The calculation formula is as follows:
where
represents the kernel density estimate,
is the Gaussian kernel function,
is the estimation point,
is the
-th infected county of shoot blight of larch,
km is the distance, and
is the number of infected counties within the bandwidth range. The implementation of spatial statistical methods such as the standard deviation ellipse, global Moran’s I index, and kernel density estimation, as well as the map creation, were all completed using ArcGIS 10.8.
2.4. Geo Detector
To examine the correlation between the spatial distribution of shoot blight of larch and various influencing factors, the Geo Detector method was employed to analyze spatial heterogeneity. Geo Detector is a spatial statistical method capable of detecting spatial heterogeneity and uncovering its underlying driving forces. Additionally, it can explore interactions among different factors [
21,
35]. The calculation formula is as follows:
where
represents the number of provinces in the study area.
and
refer to the number of counties in province
and the entire study area, respectively.
and
represent the variance in the occurrence of shoot blight of larch within province
and across the entire region. The value of
q ranges from [0, 1], indicating the explanatory power of each influencing factor on the spatial heterogeneity of shoot blight of larch. A larger
q value indicates a stronger explanatory power of the factor, and its statistical significance can be assessed using the Geo Detector method. In this study, version GeoDetector_1.0-5 was employed (
https://cran.r-project.org/web/packages/geodetector/index.html (accessed on 28 February 2025)) and executed within the R version 4.1.2 (R Core Team, Vienna, Austria) software environment.
2.5. Random Forest
Random Forest is an ensemble learning classification algorithm composed of multiple decision trees. The algorithm primarily involves three main steps: random sampling, random feature selection, and majority voting [
36]. By introducing a small random subset of variables at each node during the construction of decision trees, each node exhibits randomness in variable selection, while the decision trees themselves are independent of one another. Once parameter optimization is complete, Random Forest determines the importance of each feature by calculating its average influence across various decision trees. Within each tree, the improvement in accuracy (e.g., the reduction in Gini impurity) brought by each feature at the splitting nodes is recorded. Subsequently, the average accuracy improvement for this feature across all trees is calculated. Finally, the importance of features is standardized to facilitate comparison and analysis [
37,
38]. This level of computation provides insights into the significant impact of each feature on prediction results, enabling feature selection and model optimization. Notably, compared to many existing classification algorithms, Random Forest excels in avoiding overfitting, making it a robust and reliable choice for classification tasks.
The Random Forest regression model, based on Python 3.10 (Python Software Foundation, Wilmington, DE, USA), was constructed using the RandomForestRegressor from the Sklearn machine learning library. The study employed cross-validation and grid search within a defined range to determine the optimal values of the model parameters that yielded the best performance. This research addresses a binary classification problem, and given the application scenario’s high requirements for the model, its performance is assessed using accuracy, recall, precision, and F1 score. Specifically, accuracy refers to the proportion of correctly predicted samples among the total samples; precision represents the proportion of actual positive samples among those predicted as positive by the model; recall indicates the proportion of predicted positive samples among the actual positive samples; and the F1 score is the harmonic mean of precision and recall, serving as a comprehensive indicator that considers both precision and recall. When these four evaluation metrics exceed 80%, the model is considered reliable.
2.6. Random Forest Elimination with Cross-Validation
RFECV, which stands for Recursive Feature Elimination with Cross-Validation, is a commonly used feature selection method that combines RFE (Recursive Feature Elimination) and CV (Cross-Validation) [
39]. The feature selection process using RFECV consists of two stages. In the RFE stage, the initial feature set includes all available features. The model is built using the current feature set, and the importance of each feature is calculated. Features with low weights are removed, and the feature set is updated. This process is repeated through multiple iterations until all features are ranked based on their importance. In the CV stage, based on the feature importance determined in the RFE phase, different numbers of features are selected, and cross-validation is performed on the selected feature set. The number of features that results in the highest average score is chosen, completing the feature selection process.
2.7. Statistical Analysis
In this study, the relationship between the spread of shoot blight of larch and larch seedling afforestation was assessed using correlation coefficient statistical methods, including Logistic regression models, Spearman’s rank correlation coefficient, Kendall’s Tau correlation coefficient, and the Mann–Whitney U test, which are classical non-parametric statistical methods.
The Logistic regression model is primarily used to analyze the causal relationship between binary dependent variables and independent variables. Coefficients, odds ratios (ORs), and
p-values are used to assess the effect of independent variables on the dependent variable. The size and sign of the regression coefficients reflect the direction and strength of the effect of each independent variable on the dependent variable. OR is obtained by exponentiating the regression coefficients and is always positive, with a range of [0, ∞) [
40].
Spearman’s rank correlation coefficient and Kendall’s Tau are used to analyze the monotonic (non-linear) relationship between variables and examine the correlation between independent variables. The results are evaluated mainly by the correlation coefficient value and
p-value. The correlation coefficient ranges from −1 to +1, with +1 indicating perfect positive correlation, −1 indicating perfect negative correlation, and 0 indicating no correlation [
41].
The Mann–Whitney U test is used to compare the distribution differences between two independent samples and is suitable for non-normal distributions. The results are typically reported using the U statistic and
p-value. The U value itself does not have a direct “good or bad” standard but reflects the difference in rank order between the two groups. Smaller U values generally indicate greater differences between the groups [
42].
The p-values from all four methods are used to verify statistical significance. If the p-value is less than 0.05, it indicates that the correlation is statistically significant. The construction of the Random Forest and RFECV models, as well as the correlation analysis, was completed using PyCharm 2022.2.1 (JetBrains, Prague, Czech Republic).
4. Discussion
4.1. The Transmission and Diffusion Mechanism of Shoot Blight of Larch
The human activities of transplanting infected seedlings are the dominant factor in shaping the spatial distribution of shoot blight of larch, with seedling planting area showing a significant positive correlation with the occurrence of the disease. From 1989 to 1996 and 1996 to 2007, China experienced the outbreak and stable periods of shoot blight of larch, respectively. The development of the disease during these two periods largely shaped the current spatiotemporal pattern of shoot blight of larch, which is now distributed across 12 provinces in China. After 1978, China launched the “Three-North Shelterbelt Project”, during which large-scale transportation of seedlings for afforestation likely carried larch seedlings infected with the blight pathogen from the three northeastern provinces to North China and Northwest China, serving as the primary cause of the disease’s introduction to these regions. Guo [
25] noted that between 1981 and 1983 alone, afforestation in the Loess Hill area covered 2800 mu, involving the planting of over 600,000 trees, most of which were North China larch.
This study analyzed the relationship between the area of larch seedlings planted and the occurrence of larch shoot blight (
Figure 7). A linear relationship was observed between the area of larch seedlings planted and the proportion of counties affected by larch shoot blight. As the area of larch seedlings planted within counties increased, the proportion of counties affected by larch shoot blight also increased. When the area of larch seedling planting exceeded 50,000 mu, the proportion of affected counties reached 64.29%. This indicates that the area of larch seedling planting can influence the spread of larch shoot blight. Furthermore, the study evaluated the relationship between the spread of larch shoot blight and the area of larch seedling planting using Logistic regression models and correlation coefficient statistical analysis methods.
The natural environment is a fundamental factor influencing the spatial pattern of shoot blight of larch. The occurrence and development of shoot blight of larch are directly influenced by ecological factors, with temperature, precipitation, canopy closure, and wind speed being the main contributing factors. Suitable temperatures are critical for the germination of ascospores and conidia of the blight [
44]. This study identified average temperature in August as the most important factor influencing the outbreak. Symptoms of shoot blight of larch are most pronounced from mid-August to early September, and from late August to early September, perithecia gradually form on diseased branches, laying the groundwork for outbreaks in the following year. Analysis revealed that 92.23% of affected counties have an average August temperature between 9.50 °C and 22.28 °C, which aligns closely with previous studies suggesting the optimal mean temperature range for the warmest quarter is 10.1 °C to 24.0 °C [
14].
Precipitation is also a critical factor influencing the outbreak of shoot blight of larch. Regions with high annual precipitation and relative humidity experience more severe infections in new larch shoots [
26]. Furthermore, this study found that average precipitation in June significantly affects the spread of the disease. This period coincides with the spore dispersal stage and the active development of the disease. Research by Pan Xueren et al. indicated that spore dispersal peaks significantly following continuous rainfall. This finding underscores that adequate precipitation and humidity favor spore germination, intensifying the spread and severity of the disease.
The primary means of shoot blight of larch infection and development involve spore dispersal by wind and penetration through wounds [
45]. Yu and Zhao [
29] suggested that a daily maximum wind speed exceeding 4 m/s in May and June is a necessary condition for the outbreak of shoot blight of larch. Compared to Yu’s study, this research found that fewer than 10% of the affected counties met this condition, and all these counties were within 500 km of the coastline. This is consistent with the geographic limitation of Yu’s study, which focused on the three northeastern provinces, all located within 500 km of the coast.
Regarding the relationship between shoot blight of larch and canopy closure, previous studies have shown that dense forest stands with high canopy closure, which limit ventilation and light penetration, favor disease occurrence and development [
46]. Disease severity tends to increase with canopy closure, and stands with low density exhibit lighter disease incidence compared to dense stands [
47]. This study found a positive correlation between the number of affected counties and canopy closure, with over 80% of affected counties having a canopy closure greater than 40%.
4.2. Comparison and Selection of Models
This study represents the first application of two distinct modeling approaches, Geo Detector and Random Forest, to identify the key factors influencing the spatiotemporal dynamics of shoot blight of larch. By employing these two complementary modeling techniques, we found that several factors yielded p-values below 0.05 in the Geo Detector analysis, indicating statistically significant results that are both interpretable and reliable. In the results of the Random Forest algorithm, the model’s accuracy, recall, precision, and F1 score reached 90.76%, 85.11%, 90.91%, and 87.91%, respectively. These favorable evaluation metrics indicate that the model performs well, suggesting that the results are highly reliable.
Given that each model algorithm emphasizes different aspects, each has its inherent strengths and limitations. A singular Random Forest approach, by itself, is insufficient for further elucidating the contributions of individual influencing factors. To address this, numerous studies have opted to enhance the algorithm, such as improving classification and regression trees [
37], reducing the correlation between regression trees [
38]. Ultimately, the integration of multi-dimensional evaluation methods and model optimization contributes to enhancing the reliability and generalizability of Random Forest. Furthermore, the application of multiple algorithms in tandem facilitates a more comprehensive, multi-angle analysis of the influencing factors. Methods such as Principal Component Analysis (PCA) [
48], Exploratory Factor Analysis [
49], and Geo Detector can be effectively integrated with Random Forest to deepen the analysis [
21]. Hence, the selection of appropriate, mutually reinforcing research methodologies based on specific research objectives is essential.
For a given system and problem, the insights and predictions derived from a model often rely more on the modeling team involved than on the scenario being analyzed [
50]. For example, in the case of global and European land cover projection models [
51], as well as the predictions of Antarctic krill growth in the Southern Ocean made by eight different models, which are contradictory in many regions [
52,
53]. Therefore, selecting an appropriate model based on the specific characteristics of the research objectives is crucial.
Species distribution models (SDMs) are widely used in ecology, integrating environmental variables (such as temperature, precipitation, and elevation) with known species distribution data to predict the potential distribution of species in other unstudied areas. These models can also be combined with models such as Random Forest for improvement and possess geographical transferability for species with similar distribution environments [
54,
55]. Unlike the complex statistical modeling and hypothesis testing of species distribution models, Geo Detector primarily focuses on spatial distribution differences between variables to assess the strength of factors influencing species distribution, without requiring extensive model parameter adjustments. Geo Detector is particularly effective in identifying interactions between multiple factors, especially in complex environmental contexts, revealing which factor combinations have a significant impact on spatial distribution. Through repeated testing and analysis of interactions at different levels, it can provide more detailed information. While species distribution models can also incorporate interaction terms, they often rely more on prior hypotheses, and the selection and interpretation of interaction terms require strong domain knowledge, making them more challenging to intuitively grasp under the influence of multiple factors.
The results indicate that both the Random Forest and Geo Detector models provide similar findings regarding the primary factors influencing the spatiotemporal distribution of shoot blight of larch. The key determinants identified include the planting area of seedlings, canopy density, maximum wind speed in June, average temperature in August, annual maximum temperature, average precipitation in June, and annual average precipitation. The Random Forest algorithm, focusing primarily on the occurrence of the disease, highlights the importance of temperature, a factor directly influencing fungal proliferation. In contrast, Geo Detector places greater emphasis on the spatial and temporal impact of these factors, as well as their interactions, suggesting that human activities—particularly the planting area of seedlings, which facilitates the long-distance spread of the pathogen—constitute the most significant factor. This further underscores the value of combining these two models to achieve a higher degree of reliability and accuracy in identifying the key factors driving shoot blight of larch in China. Such insights are critical for predicting the future progression of the disease and for formulating effective, timely mitigation strategies.
4.3. The Limitations and Prospects of the Study
First, the influencing factors considered in this study remain incomplete. In real-world environments, the spread and prevalence of shoot blight of larch are influenced by more complex factors, such as tree species, forest age, slope aspect, elevation, and interspecies interactions. Since the mechanisms by which these factors affect the disease are unclear or difficult to quantify, the results of this study can serve as a reference for understanding the relationship between shoot blight of larch and environmental variables but cannot fully encapsulate their interactions. Second, this study used seedling planting area as a proxy for human activities to explore its relationship with the spread of shoot blight of larch. Due to the lack of precise annual data, this factor can only roughly describe the influence of human activities on the disease’s spread and diffusion during certain periods. Despite these limitations, this study represents a meaningful preliminary exploration that contributes to understanding the spatiotemporal patterns of shoot blight of larch’s spread and diffusion in China. The several key factors identified in this study can be used for further predictions regarding the occurrence and development of shoot blight of larch. Based on the aforementioned findings, the following measures can be implemented for effective management of the pathogen: (1) improve nursery sanitation by removing plant debris, regularly cleaning containers with hot water, and promptly removing infected plants that produce spores; (2) develop resistant varieties by selecting and breeding larch varieties with strong resistance, thereby reducing the risk of disease outbreaks; and (3) conduct regular disease monitoring and early warning to ensure early detection and timely management, thus achieving integrated control and ensuring the healthy growth of larch seedlings.