2.2.3. Population Dataset

The population dataset was collected from WorldPop (https://www.worldpop.org/, accessed on 28 July 2021). It details the spatial distribution of the population with a spatial resolution of 100 m. Its units are number of people per pixel with country totals adjusted to match United Nations national population estimates. The format of this dataset is raster, where the digital value of every pixel reflects the total population within this grid. Considering that the samples in the earthquake case dataset have a long time series while population data of a single year have difficulty reflecting demographic changes, we collected population records in China's mainland every five years from 2000 to 2020 (2000, 2005, 2010, 2015 and 2020) to explore the change in population in a long time series.

A methodological flowchart of the investigation is shown in Figure 5.

**Figure 5.** Framework of the Z-SVR model.

Seismic fatality is a comprehensive result that is influenced by diverse factors, and whether a factor has a decisive impact on earthquake casualties is an essential question for feature selection of prediction models [33]. Therefore, before constructing a prediction model for earthquake casualties, it is crucial to establish a reasonable index system and analyze the importance of relevant indicators, which will serve as a reference for the prediction model to select more important features. Based on regional disaster system theory, this study established an evaluation index system for 14 major features that affect earthquake fatality. We used the earthquake case dataset and the random forest model to assess the importance weights of features, of which the ranking served as an important reference for feature selection of the prediction model.

Because of the variations among regions, there will be different numbers of casualties due to earthquakes with the same ground motion parameter. Therefore, in earthquake cases with the same seismicity, the diversity of disaster-formative environments and disasteraffected bodies reflects the difference among regions [34]. Due to the vast area of China's mainland, it is difficult to build a universal prediction model that is suitable for all regions. To enhance the accuracy of earthquake disaster assessment in emergency periods, it is effective to divide the study area into risk zones based on regional differences and construct a model that performs well for each risk zone. Based on the results of the importance assessment and feature selection, geological fault density and population density are the most important features of disaster-formative environments and disaster-affected bodies, respectively. Therefore, we chose these two features with relatively high importance weights as representative factors for developing a partition standard and dividing the study area into the defined grades of risk zones. The accuracy and applicability of the earthquake casualty prediction approach can be improved by building different submodels for areas with different regional characteristics.

As an extension of support vector machine (SVM) for solving regression problems support vector regression (SVR) has attracted much attention in the field of machine learning and displayed strong predictive ability in mortality evaluation. Compared with other machine learning algorithms, SVR can achieve the optimal solution with a small number of samples and avoid problems such as overfitting and local extremum as much as possible, which makes its generalization ability and performance stand out [35]. However, as a machine learning method that is based on historical statistics, it may be difficult for the SVR model to accurately predict casualties due to earthquakes occurring in different regions of the study area, especially those with vast acreage and diverse environments. Therefore, based on the characteristics of SVR and regional differences in the study area, we constructed a zoning SVR model (Z-SVR) for various regions in the study area; for which the optimal model parameters for all risk zones were identified using training samples from the earthquake case dataset.

### **3. Spatial Division**

#### *3.1. Importance Assessment*

According to regional disaster system theory, a seismic disaster is a complex mechanism that is a comprehensive result of interactions between disaster-inducing factors, disaster-affected bodies and disaster-formative environments [36]. Among them, disasterinducing factors, such as seismic magnitude and focal depth, are the sufficient conditions for disaster occurrence; disaster-affected bodies, such as population distribution and building destruction, represent the necessary conditions for disaster resilience; and disaster-formative environments, such as climatic condition and secondary disaster, provide a natural and human geological background that affects disaster-inducing factors and disaster-affected bodies [17]. The loss due to a disaster is attributed to the combined effects of these three factors; therefore, for screening the prediction indicators, we constructed an evaluation index system on the basis of regional disaster system theory, which is presented in Table 3.


**Table 3.** Evaluation index system of features that influence earthquake fatality.

Determining the importance weights of all features in the evaluation index system is a quantitative task in importance assessment. Although traditional linear models show good performance in the importance assessment of factors that affect earthquake fatality, the result can be easily disturbed by the uncertainty and fuzziness of input data [37]. An integrated ensemble model is an effective approach for mitigating the above problem and improving the accuracy and generalization performance of the evaluation method [38], which was demonstrated by previous studies [39]. Random forest (RF) is an effective integrated ensemble model with random binary decision trees for classification or regression [39]. As an expansion of the bagging method, this algorithm constructs multiple independent estimators that determine the output result by average or majority voting. This approach enhances the precision and stability of the prediction model, reduces the sensitivity of the model to noise and outliers, and avoids problems such as overfitting [40]. In contrast to other machine learning methods, the RF model can provide the quantified importance of prediction indicators by calculating their increases in predictive error by randomly permuting the values of a variable through out-of-bag observations of each tree.

We chose 7 indicators of disaster-inducing factors, 4 of disaster-affected bodies and 3 of disaster-formative environments as the input parameters of the RF model to evaluate their importance to earthquake fatality. The values of the input parameters were extracted from the earthquake case dataset. We utilized the machine learning package scikit-learn of the Python programming language to construct the RF model. The "feature\_importances\_" is an attribute of the RF model in the scikit-learn package. The importance of a feature is computed as the normalized total reduction of the criterion brought by that feature. The procedure is summarized as follows:


The ranking of all factors according to the importance weights from low to high is shown in Figure 6.

**Figure 6.** Importance weights of indicators on the index levels.

Based on the results of the importance assessment of influential features, magnitude, collapsed buildings, epicenter intensity, population density, geological fault density and GDP are major factors that affect seismic fatality. Magnitude and epicenter intensity are the two most important parameters to depict the severity of an earthquake and exert substantial influence on the seismic fatality; however, there is a strong correlation between these two features. To avoid information redundancy, we selected magnitude, which has greater importance weight, as the input parameter of the Z-SVR model. Building destruction is the direct cause of earthquake injuries and deaths [41], and the primary task of emergency rescue is to search for people who are buried in collapsed constructions. However, the aim of the proposed model in this study is to rapidly predict the possible casualties of an instantly occurring earthquake, which requires an extremely fast response speed. It will take some time to identify the situation of building destruction and count the number of collapsed buildings. Population density is the most important feature among the disasteraffecting bodies; since human beings are the major victims of earthquakes, it is significant to choose this feature as one of the prediction indicators. Geological fault is the most important factor under the level of disaster-formative environments, where the density of strata fault lines can be used to quantitively analyze regional differentiation and merits consideration. GDP is a comprehensive indicator that is mutually restricted with population density in terms of earthquake casualties; therefore, it is significant to introduce this factor as an input parameter and consider its comprehensive effect with population density to ensure the stability and accuracy of the prediction results. In conclusion, based on the result of the importance assessment and the principles of rapid evaluation and avoiding information redundancy, we finally selected magnitude, population density, geological fault density and GDP as the input parameters for the construction of the Z-SVR model, among which geological fault line density and population density were also applied to divide the study area into risk zones.
