**1. Introduction**

Earthquakes are among the most unpredictable and destructive natural hazards around the world and have caused extremely heavy damage to human life and possessions [1–4]. China is located at the intersection of the Alpine-Himalayan and Circum-Pacific seismic zones, and is subjected to the collision and compression of the Eurasian Plate, Philippine Plate and Indian Plate [5,6]; hence, it has always been prone to earthquakes [7,8]. To date, there have been nine catastrophic earthquakes with more than 200,000 casualties in the world, of which three occurred in China. Since 1949, more than 100 destructive earthquakes have occurred in 22 provinces of China, which have caused 270,000 casualties in total, thereby accounting for 54% of all deaths from natural disasters in this country [5]. Considering the heavy destruction of earthquakes in China's mainland, this study selected it as the study area.

After an earthquake, it is necessary to promptly and efficiently conduct emergency rescue to reduce damage and prevent further increases in the damage degree. An early prediction of the death toll that is caused by the earthquake is an essential reference for the government to determine which grade of emergency response [9] to be launched and what amount of relief supplies to be mobilized to the affected areas [10]. Therefore, rapid and accurate prediction of the number of earthquake casualties is a focus of disaster assessment research.

**Citation:** Li, B.; Gong, A.; Zeng, T.; Bao, W.; Xu, C.; Huang, Z. A Zoning Earthquake Casualty Prediction Model Based on Machine Learning. *Remote Sens.* **2022**, *14*, 30. https:// doi.org/10.3390/rs14010030

Academic Editors: Paolo Mazzanti and Saverio Romeo

Received: 27 November 2021 Accepted: 18 December 2021 Published: 22 December 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Related studies on seismic casualty prediction focus mainly on two aspects. One aspect is the relationships between relevant factors and the number of earthquake casualties; these studies can be broadly classified into three categories. The studies in the first category explore the impact of seismic parameters on earthquake fatality. Xiao [11] analyzed the relationship of seismic intensity and population density with the mean mortality rate, and proposed an empirical formula for rapidly assessing the death toll, which has been recommended as an effective method for evaluating the mortality rate by *Assessment of Earthquake Disaster Situation in Emergency Period* (a China's national standard). Jaiswal and Wald [12] analyzed the mortality rates of earthquakes with various shaking intensity levels all around the world and proposed a country/region-specific empirical model by using an optimization method to evaluate seismic mortality. The studies in the second category seek to identify the relationship between building vulnerability and earthquake fatality. In the 1980s, commissioned by the Federal Emergency Management Agency (FEMA), the Applied Technology Council (ATC) [13] surveyed and classified buildings in California and proposed the ATC-13 earthquake damage matrix for systematically studying and forecasting possible earthquake losses in this region. Ceferino et al. [14] proposed a probabilistic model for evaluating the number and spatial distribution of casualties due to earthquakes, which improved methods that focused only on a single-building by taking multiple buildings into consideration. The studies in the third category consider the impact of other factors, such as secondary disasters or demographic characteristics, on human loss. Bai et al. [15] scientifically assessed the possible casualties that were caused by secondary disasters and developed a logical regression model for predicting the death toll caused by landslides in the 2014 Yunnan Ludian *MS* 6.5 earthquake. Shapira et al. [16] integrated risk factors that are related to population characteristics (age, gender, physical disability and socioeconomic status) and proposed a model on the basis of the widely used loss estimation model HAZUS.

Other studies focus on enhancing the accuracy of prediction models by improving models or proposing new methods [17,18]. Karimzadeh et al. [19] presented a GIS-oriented procedure in combination with geo-related parameters for identifying the destruction in earthquake-stricken areas and evaluated the seismic loss based on damage functions and relational analyses. Feng et al. [20] regarded building damage as a major cause of earthquake deaths, and used high-resolution satellite imagery to detect building damage in disaster areas. They developed a model for estimating the mortality rate due to an earthquake based on remote sensing and a geographical information system. To solve the problems in the evaluation systems (low precision, long time consumption and poor stability), Zhang [21] proposed a seismic disaster casualty assessment system based on mobile communication big data. Considering that seismic data has the characteristics of small scale, nonlinearity and high dimensionality, many scholars have applied machine learning methods, such as support vector machine (SVM), artificial neural network (ANN), and random forest (RF), to earthquake casualty prediction models in recent years. Xing et al. [22] improved SVM with a robust loss function and used it to construct a robust wavelet earthquake casualty prediction model. Gul and Guneri [23] used earthquake magnitude, occurrence time, and population density as input parameters and built a model for earthquake casualty prediction based on the theory of ANN. Jia et al. [24] used the RF model to compare the importance of features affecting the number of earthquake casualties and proposed a deep learning model for casualty prediction.

According to the literature review above, relatively complete earthquake casualty prediction methodologies have been presented by researchers from various aspects, which provide references for feature selection and model construction in our study. However, an analysis of the previous studies on earthquake casualty prediction reveals the following shortcomings: (1) many prediction methods, especially those that utilize empirical functions, can only be implemented with abundant historical seismic data, which makes it difficult to obtain reliable prediction results when a limited quantity of data are available; (2) some scholars simply considered one earthquake as the case and used a small number of

samples to predict the death toll, whose achievements may be difficult to apply and deploy due to the under-representativeness of predictors and methods; and (3) most studies simply focused on the statistical relations between influential features and earthquake casualties, which led to inadequate representativeness and lack of a theoretical basis for the generality of such prediction models.

Based on the above observations, this study aimed to (1) evaluate the importance of influential features on seismic fatality, study the regional variations in natural and human geographical environments, and propose a spatial division approach for dividing the study area into three degrees of risk zones; (2) improve the support vector regression (SVR) model with reasonable input factors and the best model parameters for all risk zones; and (3) evaluate the performance of the proposed zoning model through experiments.

The remainder of this paper is structured as follows. Section 2 introduces the geographical and seismic background of the study area and describes the data and methodology that are used in this study. Section 3 presents the process and result of importance assessment and proposes the approach of spatial division. Section 4 derives the SVR algorithm in detail and presents the flow of the data processing and model construction. Section 5 presents the experimental results of the proposed method. Section 6 discusses the results and compares them with those of other models. The conclusions of this study are contained in Section 7.

#### **2. Materials and Methods**

#### *2.1. Study Area*

China's mainland is located at the intersection of the Alpine-Himalayan and Circum-Pacific seismic zones, where destructive earthquakes occur frequently [25]. Seismicity in China's mainland is characterized by high frequency, wide distribution, great intensity, shallow seismic focus, and clear regional differences. Most earthquakes in this area are shallow focus earthquakes that occurred within the continental crust, whose principal type are strike-slip type [26]. Based on statistical data from the Earthquake Science Knowledge Service System (http://earthquake.ckcest.cn/earthquake\_n/dzml/ch5.html, accessed on 15 July 2021), we developed a chart of the spatial distribution of historical earthquakes in China's mainland. Figure 1 shows the positions of plates and all earthquakes over *MS* 4.0 that have occurred in China's mainland since 1950. These earthquakes are widely distributed in China's mainland and the spatial pattern of seismic activities in this area is featured by strong activities in the west and weak activities in the east.

With an area of 9.6 million square kilometers (including Taiwan Province), China has diverse natural and human geographical environments that differ in terms of climates, landforms and geological conditions in China; hence, it is difficult to build a single seismic casualty prediction model that is suitable for the whole area. Seismic destructive effects in this vast area are obviously regional. Figure 2 shows the distribution of population and historical earthquakes in China's mainland. The frequency of earthquakes and life losses caused by these disasters are roughly bounded by a population dividing line called the Hu Line [27]. To the east of the Hu Line, earthquakes have caused lager death tolls than those to the west of this boundary, although high seismicity has been observed in the west. Since 1949, 19 provinces in China's mainland suffered deaths due to earthquakes, among which Hebei, Sichuan, and Yunnan Provinces suffered the most life loss events, accounting for more than 90% of all casualties [28].

**Figure 1.** Historical earthquakes and plate distribution in China's mainland; nine-dotted line is the boundary of China's territory in the South China sea.

**Figure 2.** Historical earthquakes and population distribution in China's mainland.

#### *2.2. Materials*

The data that were used in this study included a geological fault dataset, a population dataset and an earthquake case dataset. This study trained and verified the proposed prediction model using the earthquake case dataset, which was also used to evaluate the importance of factors affecting seismic fatality. Geological fault and population datasets were used to divide the study area into defined risk zones based on regional differences.

#### 2.2.1. Earthquake Case Dataset

The majority of the earthquake cases were collected from the Earthquake Science Knowledge Service System (http://earthquake.ckcest.cn/featured\_resources/disaster\_ show.html, accessed on 20 July 2021), which includes 479 records of earthquakes over *MS* 4.0 that have occurred in China's mainland since 1950. We deleted cases without deaths, corrected and supplemented the dataset with relevant literature and reports [29–32], and finally selected a total of 152 seismic cases with death registers in China's mainland. The original earthquake case dataset only had attributes such as location, occurrence date, magnitude, focal depth and death toll. Because information about historical earthquakes is very limited and difficult to acquire, a large part of the data mining process was devoted to collecting and supplementing relevant attributes. We complemented the attributes of earthquakes, including epicenter intensity, aftershock, landform, climatic condition, secondary disaster, collapsed buildings and rescue capability, from their disaster situation evaluation reports and relevant literature [24]. The attributes of occurrence time and day were converted from the occurrence date. We calculated the linear density of strata faults in ArcGIS software, and used the statistical analysis tool in ArcGIS to acquire the earthquake attribute of geological fault density. The attributes of population density and the Gross domestic product (GDP) were collected from statistical yearbooks of provinces where earthquakes occurred. GDP is a monetary measure of the market value of all the final goods and services produced in a specific time period. The data we collected is per capital GDP, which is the ratio of GDP to the total population of the earth-quake-stricken region. Detailed information about each attribute in the earthquake case dataset is provided in Table 1.


**Table 1.** Specification of attributes in the earthquake case dataset.

To describe the data distribution characteristics of earthquake cases, we divided their numbers of casualties into 6 categories: 0–9, 10–99, 100–999, 1000–9999, 10,000–99,999, and ≥100,000. Then, we calculated the piecewise frequency statistics for each category and plotted a statistical chart, which is shown in Figure 3. As shown in this graph, the death tolls of most earthquakes in the dataset were within the ranges of less than 10, 10–99 and 100–999. Strong earthquakes with many casualties occurred with lower frequency; hence, this study focuses on accurately predicting the death toll for earthquakes with less than 1000 casualties.

**Figure 3.** Piecewise frequency statistics of earthquake casualties.

In the construction process of the machine learning model, earthquake samples with many casualties will exert a significant impact on the performance of the prediction model. To evaluate the influence of samples with great values, we conducted an experiment to compare the prediction performance between two data groups: Group A and Group B. Group A was the dataset including all the 152 seismic cases with 1000 casualties or more. Group B was the dataset excluding samples whose numbers of casualties were more than 1000. We took Group A as the training dataset and input it into SVR model, and used the 10-fold cross-validation method to evaluate its prediction performance. The evaluation indicators employed in this experiment were root mean square error (RMSE) and mean absolute error (MeaAE), which are described in detail in Section 6.1. The same experiment was also conducted in Group B. We calculated the average RMSE and MeaAE values for the two groups. The result showed that the RMSE and MeaAE of Group A were 6579.29 and 2346.96, respectively. By contrast, the RMSE and MeaAE of Group B were 48.27 and 40.41 respectively, which means Group B shows significantly better prediction performance due to the exclusion of extreme value samples.

Considering that the devastating earthquakes with more than 1000 casualties occur extremely unfrequently, and their disaster mechanisms are much more complicated, the study focuses on accurately predicting the death toll for earthquakes with less than 1000 casualties. Therefore, we removed cases with more than 1000 casualties in order to avoid the influence of great values. A total of 143 seismic cases with death registers were finally selected. The procedure of dataset division is as follows. (1) In Section 3, we proposes a spatial division method and divides the study area into three groups: high, moderate and low risk zones. Based on the result of spatial division, those selected cases were divided into three parts, including 49 cases in low risk areas, 13 in moderate risk areas, and 81 in high risk areas. (2) To evaluate the prediction accuracy of the Z-SVR model for three degrees of risk zones, we divided the dataset into training and testing datasets. For earthquake cases in each degree of risk zones, we randomly extracted 1/5 of them as the testing dataset, and the remainder was divided into the training dataset. We finally extracted 10 cases in low risk zones, 3 in moderate risk zones, and 17 in high risk zones as the testing dataset

to evaluate the performance of the seismic casualty prediction model. The remainder was used as training dataset for building Z-SVR model. Table 2 presents the division of sample dataset. Figure 4 shows the spatial distributions of historical cases.

**Table 2.** Numbers of training and testing samples in the defined risk zones.


**Figure 4.** Spatial distributions of the earthquake case dataset: (**a**) Training samples; (**b**) testing samples.

### 2.2.2. Geological Fault Dataset

We collected the geological fault dataset from the China Earthquake Data Center (http://datashare.igl.earthquake.cn/map/ActiveFault/introFault.html, accessed on 24 July 2021). It provides the spatial distribution of strata faults in China; the data are in vector format and can be used for spatial analysis in ArcGIS software. This dataset includes 1966 fault segments. For 456 of these segments, detailed parameters such as age, orientation and sliding rate are provided; for 664, only the name and number are specified; for 846, only graphical features are provided, without any attributes. Since the coordinate system of the dataset is the Krassovsky ellipsoid with the Albers projection, we used the projection raster tool in ArcGIS to convert it into the WGS 1984 to ensure the consistency of the spatial reference.
