*2.2. Methods*

The machine learning algorithms adopted in this article include decision tree [14,16,21], and random forest [22–24]. The evaluation methods are Prediction Consistency (Pc), Prediction Score (Ps) and Correlation Coe fficient (Cc) [25–27], which are used in the China Meteorological Administration (CMA). The modeling period is from 1961 to 2010, and the independent inspection and evaluation period is from 2011 to 2018. The decision tree and stochastic forest model are multi-factor prediction models using 130 circulation indexes, while the prediction model established by using any one of the 130 circulation indexes is a single-factor prediction model.

#### 2.2.1. The Decision Tree

Yang et al. [28] introduced the basic concepts and common algorithms of decision tree, which can be used to form classification and predictive model. Suppose *D* = (*<sup>x</sup>*1, *y*1),(*<sup>x</sup>*2, *y*2), ... ,(*xn*, *yn*) , including *xi* = *xi* (1), *xi* (2), ... , *xi* (*n*) *T* as input variables (circulation index), *n* is the number of features (the summer model is 130, the winter model is 34), *yi* ∈ {1, 2, ... ,*K*} is the category-type response variable (that is, the amount of precipitation), *i* = 1, 2, ... , *N*, *N* is the sample size (from 1961–2018, 58 years). Among them, 1961–2010 is the training data set and the completion of model training; 2011–2018 is the independent test data set and the independent inspection and evaluation. The goal of decision tree learning is to build a decision tree model based on a given training set to enable it to correctly classify instances. In this paper, the C4.5 algorithm of Quinlan is adopted for decision tree generation [29].

## 2.2.2. Random Forest

Random forest is a multifunctional machine learning algorithm. It was first proposed by Breiman, a professor of statistics at the University of California, Berkeley, in 2001, and can perform regression and classification calculations. The basic composition of the random forest is classification and regression tree (classification and regression tree) invented by Breiman and other inventions. Compared with machine learning algorithms such as neural networks, this algorithm of repeated classification and regression of binary data e ffectively reduces the amount of calculation. Random forest is the combination and re-aggregation of these classification trees. Random forest improves the estimation accuracy without significant increase in the calculation amount, and it is insensitive to missing values and multivariate collinearity, and can estimate up to thousands of explanatory variables, which why it is known as one of the best algorithms at present [30,31].

Random forests use the Bagging method to combine decision trees, and they use the Bootstrap sampling methods (Bootstrap method) to extract N samples from the original sample to model the decision tree. Under normal circumstances, random forests will randomly generate hundreds to thousands of decision trees. Each tree in the forest is independent, and then the most repetitive tree is selected as the final result. Since there is no need to consider constraints such as variable distribution conditions, interactions, non-linear e ffects and even missing values, the structure of the random forest is complex, but it is robust and easy to use [32–34].

The specific construction process of the random forest is as follows:


In the process of building a random forest, there are two parameters that need to be set by the user according to the specific situation. In most cases, the default parameters of the model can obtain the optimal simulation results without adjustment. The term "random" in random forest refers to the two random parameters here. The introduction of these two randomness factors is crucial to the classification performance of a random forest. Due to their introduction, the random forest is not easy to fall into overfitting and has a good anti-noise ability (for example, insensitive to the default value). Therefore, the random forest models established in this paper to estimate precipitation all use default parameters.

## 2.2.3. Test Method

In order to test the climate prediction quality, CMA used a prediction grading score in 2010, then Ps and Cc in 2013. In order to be consistent with the current climate operations, Pc, Ps and Cc are used in this paper to test the prediction quality of the summer precipitation in Chongqing.

(1) Pc is evaluated station by station on the basis of whether the predicted and actual anomaly coincidence was consistent. The consistency rate formula is defined as follows:

$$\text{Pc} = \frac{\text{N0}}{\text{N}} \times 100\%$$

where N0 is the number of the stations with correct climate trend prediction; N is the number of stations actually participating in the assessment.

(2) *Ps* test method is a method that sets di fferent weights to comprehensively test the results of climate trend prediction and anomaly level prediction. Its test score is relatively intuitive. On the basis of the correct score of trend prediction, the correct score of abnormal prediction can still be obtained, which is equivalent to giving encouragemen<sup>t</sup> to the abnormal forecast, and its prediction score can relatively reflect the ability and level of climate prediction.

Trend prediction is the prediction of anomaly/anomaly percentage sign. When the prediction is identical to the actual sign (0 for positive), the trend prediction is correct. Anomaly level prediction refers to the prediction that the percentage of precipitation anomaly exceeds (including) ±20% and the temperature anomaly exceeds (including) ±1 ◦C.

Calculation formula of Ps test method:

$$\text{Ps} = \frac{a \times \text{N}\_0 + b \times \text{N}\_1 + c \times \text{N}\_2}{(\text{N} - \text{N}\_0) + a \times \text{N}\_0 + b \times \text{N}\_1 + c \times \text{N}\_2 + M} \times 100$$

where, *N*0 is the number of stations with correct climate trend prediction; *N*1 is the number of stations with correct first-order anomaly prediction; *N*2 is the number of stations with correct second-order anomaly prediction; *N* is the actual number of participating evaluation stations; *M* is the number of stations where there are no secondary anomalies and the precipitation anomaly percentage ≥100% or equal to −100% and the temperature anomaly ≥3 ◦C or ≤−3 ◦C; *a* is the weight coe fficients of climate trend terms, *b* is the first-order abnormal terms and *c* is the second-order abnormal terms. In this method, *a* = 1, *b* = 2 and *c* = 4.

(3) *Cc* tests the correlation of climate trend prediction products, which characterizes the degree of correlation between the forecast and the live field. The size of the correlation coe fficient can indicate the correspondence between the high and low center of the forecast field and the live field. It reflects the accuracy of the prediction result and the quality of the prediction method to a certain extent. It is one of the internationally popular prediction evaluation methods. Prediction inspection and evaluation of precipitation and the temperature mainly use precipitation anomaly percentage and average temperature anomaly to calculate their correlation coe fficients.

Specific calculation method:

$$\text{Cc} = \frac{\sum\_{i=1}^{N} \left(\Delta R\_{fi} - \overline{\Delta R\_f}\right) \left(\Delta R\_{0i} - \overline{\Delta R\_0}\right)}{\sqrt{\sum\_{i=1}^{N} \left(\Delta R\_{fi} - \overline{\Delta R\_f}\right)^2 \sum\_{i=1}^{N} \left(\Delta R\_{0i} - \overline{\Delta R\_0}\right)^2}}$$

where Δ *Rfi* is the forecast value of precipitation anomaly percentage of each station; Δ *Rf* is the average value of the precipitation anomaly percentage of all stations in the region; Δ *R*0*i* is the observed actual value of the precipitation anomaly percentage of all stations in the region; Δ *R*0 is the average value of the observed values of precipitation anomaly percentage of all stations in the region; N is the total number of stations actually participating in the assessment.

The forecast released in this article refers to the forecast submitted by the Chongqing Climate Center to the National Climate Center to participate in the assessment of forecast quality.

#### **3. Results and Analysis of Precipitation Prediction in Summer Test**

This section may be divided by subheadings. It should provide a concise and precise description of the experimental results, their interpretation as well as the experimental conclusions that can be drawn. In the actual climate prediction of Chongqing, due to the complex and changeable terrain, it is necessary to make trend judgment on the precipitation and average precipitation of 34 stations in order to obtain the forecast data and obtain the detailed spatial distribution. The authors used decision trees and random forests for correlation analysis based on both averages and individual site data from 34 sites. The single site results of the decision tree is relatively complex, and have no significant characteristics, while the single site results of random forest analysis are relatively good. However, due to the limitation of sample number, while based on average date of 34 sites, the results of random forest are not as good as those of the decision tree method. Therefore, in this paper, the decision tree model takes the average precipitation of 34 sites as the modeling object and focuses on the collaborative influence of multiple factors. In the random forest model, 34 sites were modeled, and spatial distribution characteristics were focused. The two methods complement each other.

#### *3.1. Decision Tree Model Test*

Considering the physical factors in the summer period, IBM SPSS Modeler 18.0 was used, and the CART algorithm (the same as below) was used to model (Figure 2). It can be seen from the model that the circulation index that has a large impact on summer precipitation in Chongqing includes the western Pacific subtropical high ridge line, landfall typhoon, SST in tidal zone, the northern boundary of the North African Atlantic North American High, polar vorticity in the Atlantic European region, Indian sub-high area, and 30 hPa zonal wind.

The combination of the summer rainfall trend in Chongqing and the concurrent circulation index model based on the CART algorithm is shown in Table 1. Factors with less precipitation include factors 1–4, factors with more precipitation include factors 5–7. "+" and "−" respectively represent the positive and negative anomalies of the exponent in the condition, and the percentile in brackets is the probability of less (or more).

Using the same period index from 2011 to 2018 to predict the amount of summer precipitation in Chongqing and compare it with the observations, the results are shown in Table 2.

If the prediction only considers single-factor e ffects, the northern (southern) ridge of the western Pacific subtropical high (referred to as the Western Pacific subtropical high) generally corresponds to less (more) summer precipitation in Chongqing. Based on this prediction, the northern ridge of the western Pacific subtropical high in 2011, 2012, 2015 and 2018 corresponds to less precipitation, and the results in 2015 are inconsistent. In 2013, 2014, 2016 and 2017, the southward ridge of the western Pacific

subtropical high corresponds to more precipitation, but only in 2014 and 2017. The total prediction accuracy was 62.5% (5/8).

When multi-factor synergy is considered, even if the western Pacific subtropical high is southerly, there may be less precipitation, as shown in case (3). In the actual prediction, 2011 and 2012 are completely in line with the situation (1). The percentages of precipitation anomalies are −30.5% and −22.1%, which are significantly less. The percentage of precipitation anomaly in 2013 was −26.1%, and the result was consistent with situation (3). If only the first two conditions of situation (3) are met, the probability of less precipitation is only 50%. In 2013, the 30 hPa zonal wind was significantly larger, which increased the probability of less precipitation to 100%. Similarly, in the collaborated multi-factor prediction, either the ridge line of the western Pacific subtropical high is northerly or southerly, there may be more precipitation, as shown in situation (5) and situation (6). The circulation index in 2014 is consistent with the result of situation (6). The probability of more precipitation is 100%, and the actual precipitation anomalies percentage is 6.3%, which is more normal. The circulation index in 2015 is consistent with the result of situation (5). The probability of excessive precipitation is 100%, and the percentage of actual precipitation anomaly is 11.7%. The circulation index in 2016 is consistent with the situation (3), which predicts less precipitation, but the actual situation is 9.5% more precipitation. 2016 is a typical El Nino year, and the anomaly of the atmospheric system caused by the SSP anomaly in the Pacific Ocean may be the possible reason for the failure of the prediction model in 2016 [30,31].

**Figure 2.** An analytic diagram of the relationship between the precipitation trend and circulation index in summer in Chongqing based on the CART algorithm. The '% 'represents the probability of more or less. The 'n' represents more or less annual scores (as below).



**Table 2.** Different circulation index anomaly, precipitation prediction and observation from 2011 to 2018.


2018 3.79 1.14 0.03 2.14 −0.17 −4.5 less −24.8

According to the prediction effect test from 2011 to 2018, the prediction accuracy of multi-factor synergy reached 87.5%, which was 25% higher than that of a single factor. In view of the fact that the analysis of the contemporaneous factors is more applied to diagnostic analysis, considering the actual situation of prediction, the SST index modeling of pre-winter (Figure 3) is selected to forecast the business according to the previous method.

**Figure 3.** Relationship between precipitation trend in summer and SST index in pre-winter based on the CART algorithm in Chongqing.

The combination of summer precipitation trend and winter SST index model based on the CART algorithm in Chongqing is shown in Table 3. In the model, there are 6 cases of lower precipitation and 6 cases of higher precipitation.

The model was tested based on the observation of summer precipitation in Chongqing from 2011 to 2018, the results are shown in Table 4. In the model, if the Atlantic meridional model SST with the highest correlation is considered, the precipitation in Chongqing is low if it is high, while the precipitation in Chongqing is high if it is low. The trend forecast is correct in all years except 2014. If different combinations are considered, from 2011 to 2014, the Atlantic SST to mold is on the high side, and the NINOA is low. The cold tongue ENSO index was small and the predicted precipitation was small in 2013, which is consistent with the situation (1). The difference in the remaining three years is the difference in the western hemisphere warm pool index. In 2011 and 2012, it is consistent with the situation (2), with less predicted precipitation. In 2014 it is consistent with the situation (7), with more predicted precipitation. Signals of SST in 2015 and 2016 are consistent with the situation (10), and too much precipitation is predicted. In 2017 and 2018, it coincided with the situation (3), with less precipitation forecast. From the test, it can be seen that the forecast of precipitation trend in the 8 years from 2011 to 2018 is correct when considering the coordination of multiple factors, which is 12.5% higher than that when considering only a single factor.

The above considers the multi-factor synergy of the decision tree method. Although quantitative prediction of Chongqing's summer precipitation cannot be achieved, the experiments show that, no matter whether the predictive diagnosis analysis is made by using the previous or the same period factor, it is more obvious than the single index. This also shows that the "climate system", as a complex system, is the result of the interaction of multiple factors and multiple systems. In the process of diagnosis or prediction, we not only need to analyze the characteristics and cycles of each part of the system separately, but we must also study the integration behavior of the entire system and the interaction of the sub-system. This process requires statistical analysis of a large number of data such as ocean and atmosphere, as well as various model prediction data, in order to obtain the key factors affecting the local climate, the key regions of di fferent circulation fields, and the key periods when indexes and circulation a ffect the local climate. With many "blind spots" in the physical processes and research of climate system change, current prediction methods cannot make full use of these huge data resources. It may be an important factor for large climate systems, but not necessarily a critical factor for local climatic characteristics. This will inevitably lead to "lighter and slightly heavier" situations in forecasting analysis, leading to uncertainty in the forecast Increased predictive accuracy. Therefore, with the help of decision tree and other machine learning technologies, comprehensive and valuable information can be fully mined from the vast variety of data, so as to discover the main system and collaborative influence mechanism that a ffect the local climate, which plays a significant role in improving the accuracy of local climate prediction.

#### *3.2. Prediction Experiment of Random Forest Model in Summer*

In the actual forecasting business, it is not only necessary to forecast the overall trend of the region, but also to analyze the spatial distribution pattern and forecast the rainfall centers and the occurrence locations. Therefore, based on the average model of the whole city in the previous section, this section uses random forests for prediction of 34 National Meteorological Observatories in Chongqing. In the selection of circulation index, since the actual summer forecast is released in March, the circulation factor that can be obtained at this time can only reach February. Thus, when random forests are used for prediction, this article only uses the early winter SST index modeling, regardless of the constraints such as the distribution conditions, interactions, nonlinear e ffects, and even missing values of variables. Figure 4 is the forecast distribution map of random forest precipitation and distribution of actual precipitation anomaly rate over the years 2011–2018.

It can be seen from Figure 4 that there was no consistent or excessive summer precipitation in Chongqing during 2011–2018, which is a case of di fferent spatial distributions, which also makes prediction di fficult. Comparing the forecast with the actual situation, the overall trend forecast for 8 years is more accurate. Only the spatial distribution of 2011 and 2015 is slightly di fferent, and the remaining years are relatively accurate in regional forecast. Because the forecast uses a dichotomous trend forecast and cannot be refined for anomalous forecasting, the prediction results are tested at 20% and −20% using Ps, Cc and PC test methods, respectively. The test results are shown in Table 5.

As can be seen from Table 5, the random forest prediction score is higher and more stable. The average Ps, Cc and PC scores for 2014-2018 were respectively 84.6, 0.27 and 67.1. Compared with 72.4, −0.12, and 52.9, which are released by the forecast, they are significantly improved. From the historical comparison, Ps and PC scores are consistent. 2016 and 2017 are roughly equivalent to the released forecasts, and the rest of the years are about 20 points higher than the released forecasts. The Cc score for the correlation between the predicted field and the live field is significantly better than the forecast, and, except for 2015, they all exceed 95% significance test. In contrast, the published forecast shows that Cc scores are mostly negative, which indicates that the predictive typing needs to be improved.

**Figure 4.** Forecast and observation distribution of summer precipitation in Chongqing based on random forest. (F) and (O) mean forecasting and observation, respectively.


*Atmosphere* **2020**, *11*,508





#### **4. Conclusions and Discussion**

By establishing a decision tree model based on multi-factor collaboration for summer precipitation in Chongqing and conducting random forest integration and testing, the following conclusions are reached:


In this paper, when using decision trees and random forests to predict and model summer precipitation in Chongqing, although the model has a good prediction e ffect, it is also a qualitative forecast. Quantitative prediction modeling research has not been carried out, and there are obvious limitations in precipitation prediction and central locations. The author will increase the research and development of multi-factor collaboration, multi-system integration and multi-mode collection technology in subsequent research and business. Further analysis is made on various factors a ffecting summer rainfall in Chongqing, so as to provide more evidence and clues for improving the precipitation forecasting level in this region.

**Author Contributions:** Conceptualization, C.Z. and X.D.; methodology, C.Z., X.D. and B.X.; software, B.X.; validation, X.D., B.X. and C.Z.; formal analysis, X.D. and C.Z.; investigation, J.W.; resources, X.D.; data curation, J.W.; writing—original draft preparation, X.D., C.Z. and J.W.; writing—review and editing, X.D., C.Z. and J.W.; visualization, B.X.; supervision, J.W.; project administration, X.D.; funding acquisition, C.Z. and X.D. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by national natural science foundation of China (41875111), Chongqing natural science foundation project (cstc2019jcyj-msxmX0227), Chongqing technology innovation and application demonstration general project (cstc2018jscx-msybX0165), Intelligent meteorological technology innovation team project of Chongqing meteorological bureau (ZHCXTD-201804), Data availability. The data in this study are not available for use by others.

**Conflicts of Interest:** The authors declare no conflict of interest.
