3.1. A Problem of Bi-Dimensional Dataset Features Using a Linear Threshold
In this section, the proposed classification method is assessed by comparing its results with ground-truth data on damaged buildings provided by the Japanese Ministry of Land, Infrastructure, Transportation and Tourism (MLIT) [
44]. The experiments were conducted on the same dataset used in [
41]. The dataset was prepared from two TerraSAR-X images of the coastal area of Miyagi Prefecture (
Figure 2b) and a geocoded building footprint inventory. The images were recorded on 12 October 2010 and 13 March 2011, and thus, this dataset can be used to extract infrastructural damage caused by the Great East Japan Earthquake and Tsunami, which occurred on 11 March 2011. The dataset contains 31,262 samples (
N = 31,262), each representing a building located in the affected area, with two features per sample (
). The first feature is the average difference in backscattering (
d) between the two TerraSAR-X images within a rectangular box that contains the building’s footprint. To ensure the inclusion of layover and shadowing effects, a minimum distance of 5 m was established between the edges of the rectangular box and the building footprint. The second feature is the correlation coefficient (
r) between the pixels of the two images located inside the same rectangular box. Thus, each vector sample has the form
.
Figure 3a shows a scatter plot of the bi-dimensional dataset, in which the colored marks represent the sample density.
For the estimation of the probability of collapse for each sample (
), the spatial distribution of the hazard, i.e., the inundation depth (
Figure 2c), and a fragility curve for building collapse (
Figure 3b) were used. Two cases were analyzed in this study. In the first case, the fragility curve proposed by Koshimura et al. [
26] was employed. In the second case, the fragility curve proposed by Suppasri et al. [
28] was used. Based on the geolocation of each sample
, its inundation depth was extracted from
Figure 2c. Then,
was calculated using one of the fragility functions (
Figure 3b).
For the calibration of the discriminant function using the IHF classification method, it is necessary to use samples that exhibit a uniform distribution with respect to the hazard. Thus, the samples were grouped according to their inundation depth in ranges corresponding to multiples of 50 cm. Then, from each of the first 14 groups, 373 samples were randomly extracted. Thus, a total of 5222 samples was used for discriminant function calibration. The discriminant functions obtained for the first case (i.e., with the fragility function of Koshimura et al. [
26]) are displayed in
Figure 4. The entire dataset (31,262 samples) was grouped in accordance with the damage state as reported in the survey conducted by MLIT [
44]. Seven damage state (DS) levels, from DS0 (undamaged) to DS6 (washed away), were defined in this field survey.
Figure 4a–g show the scatter plots of the samples separated by DS together with the resulting discriminant functions. As previously, the colored marks denote the density of the samples. To investigate the effects of the random sampling, the IHF method was performed several times, and the differences among the results are barely noticeable, as seen from
Figure 4a–g, where the discriminant functions from four different tests are shown. Thus, it was concluded that the effect of the random sampling was insignificant. The gradient descent algorithm was employed to solve Equation (
4). A learning rate (
) of 0.05 was used. For every test, the initial vector
was composed of zeros, that is,
.
Figure 4h shows the variations of
throughout the successive iterations. Convergence was reached after approximately 3000 iterations. From a qualitative evaluation, it is evident that the discriminant function separates the majority of the DS6 samples from the rest. To provide a quantitative evaluation,
Table 1 presents the confusion matrix. To calculate the overall accuracy (OA), all samples with damage states from DS0–DS5 were merged. The results indicate an OA of 82.2% and a Cohen’s kappa coefficient of 0.62. The producer accuracies (PAs) for samples with different DSs were also calculated. A high PA is observed for samples with damage states from DS0–DS3. By contrast, the lowest PAs are observed for samples of DS4–DS5.
Similarly, the results obtained using the fragility function of Suppasri et al. [
28] are displayed in
Figure 5 and
Table 2. The same parameters and original conditions as the previous case were employed here. Again, the effect of the random sampling was irrelevant. However, slightly faster convergence was observed. An OA of 87.5% and a Cohen’s kappa of 0.69 were achieved. In addition, high PA values with a minimum of 78.3% (for DS5) were achieved.
3.2. Generalization of the Discriminant Function
To prove that the IHF method retains the benefits of a traditional machine learning technique in terms of independence of dimensionality and discriminant function shape, two additional cases are presented. A modification of the discriminant function to consider non-linear terms can be easily implemented. For instance, let us add quadratic terms to our discriminant function (Equations (
1) and (
2)):
The resulting non-linear threshold is depicted in
Figure 6, in which all samples with damage states from DS0–DS5 are presented as “non-collapsed buildings”. The confusion matrices are shown in
Table 3 and
Table 4, from which the same level of accuracy is observed as for the previous results.
In a similar way, the number of features considered can be increased.
Figure 7 presents the parallel coordinate plot of a dataset composed of seven features: the correlation coefficient (
r), the mean difference (
d), the median difference, the mode of the differences, the standard deviation of the difference, the maximum value of the differences and the minimum value of the differences. The additional five features were calculated using the same TerraSAR-X images and the same rectangular boxes for the samples. With these features, the discriminant function is modified as follows:
In
Figure 7, the red marks correspond to the samples classified as collapsed buildings. Likewise, the blue marks denote the samples classified as non-collapsed buildings. Note that all features have been normalized with respect to the maximum absolute value. The confusion matrix is shown in
Table 5.
3.3. Discussion of the Case Study
In this section, additional comments regarding the results are addressed. In the experimental analysis, the effect of the fragility curve was observed by comparing the results obtained using the fragility curve of Koshimura et al. [
26] and that of Suppasri et al. [
28]. The reason why the fragility curve of Suppasri et al. [
28] resulted in better performance is attributed to the fact that it was created specifically for Japanese buildings. By contrast, Koshimura et al. [
26] proposed fragility curves for buildings in Indonesia. Thus, it can be seen that to achieve the best performance, it is necessary to use a fragility function designed for the specific target area. This issue might represent an important pitfall. However, at present, there is a significant number of publications concerning building damage functions for specific regions. For instance, tsunami fragility functions have been proposed for the building stocks in Sri Lanka [
45], Chile [
27], Thailand [
46] and Samoa [
47,
48]. Likewise, earthquake fragility functions have been reported for Japan [
24] and South America [
49]. Another issue is the number of fragility functions used in the case study. According to the building inventory data within the affected area, 23,767 (89.7%) wooden buildings, 1753 (6.6%) steel-frame buildings and 975 (3.7%) reinforced concrete buildings were identified. Furthermore, buildings with less than two floors represent 99% of the total. Based on these aggregates, it was assumed that the majority of the affected buildings were wooden buildings. Therefore, only one fragility curve was employed in each test. However, in most possible applications, a large variety of building types might be found, and it is essential to point out that only one fragility function may be insufficient.
With respect to the calculation of the features, it was decided to use a bigger region than the building footprint in order to consider the shadowing and layover effects. As a secondary effect, some pixels of neighboring buildings were included, as well. However, the features
r and
d are aggregate values of the rectangular box used for their calculations. Thus, it should not affect the results significantly. This assumption is confirmed from
Figure 4 and
Figure 5, in which the collapsed and non-collapsed buildings are grouped in different regions.
Regarding the low accuracy observed for the samples classified as DS5 (
Figure 4 and
Figure 5 and
Table 1,
Table 2,
Table 3,
Table 4 and
Table 5), according to MLIT, a building was classified as DS5 when its main structure had been compromised. Thus, it is very likely that the DS5 samples might include both collapsed and non-collapsed buildings. However, in the accuracy evaluation, all DS5 samples were assumed to be non-collapsed buildings. This assumption might be the reason for the low accuracy on the DS5 samples, which, in turn, was the main reason for the low user accuracy (UA). Another source of misclassification in the results is the presence of speckle noise in the SAR images. The speckle noise is associated with constructive and/or destructive interference of the reflected electromagnetic waves from different objects within a pixel. The TerraSAR-X images have been speckle filtered; however, it cannot be completely removed. Its effect is reflected in the features
r and
d, as can be shown in
Figure 4, where some samples of collapsed and non-collapsed buildings are overlapped.
A comparison with classification methods based on supervised machine learning is desirable here. The results of Wieland et al. [
16] and Bai et al. [
22] were chosen for this purpose. Both research groups used the same TerraSAR-X images as in the present study. Moreover, the same information from MLIT was used to create the training data. However, the features (i.e., the elements of the vector
) were calculated following a different procedure. Thus, the comparison here should be considered only for reference. It cannot be concluded that one method is superior to the other. In addition to the images used in our case study, Wieland et al. [
16] used three further SAR images: a TerraSAR-X image acquired on 19 June 2011 and two ALOS PALSAR images acquired on 5 October 2010 and 7 April 2011. Wieland et al. [
16] performed several tests to assess the effects of the number of images, the number of features, etc. For the present comparison, the results they obtained using the same images as in our study are used.
Table 6 shows the evaluation of the classification performance. Here, the user accuracy (UA), producer accuracy (PA) and a combination of both scores (
) are presented as accuracy measures [
50]. The last row (Total) lists the average scores for the collapsed and non-collapsed samples. Our results show the same level of accuracy as the results of Wieland et al. [
16]. Bai et al. [
22] addressed a more challenging problem, namely the detection of damaged buildings based on post-event imagery only. Moreover, three classes were defined: washed-away buildings, collapsed buildings and slightly damaged buildings. Deep neural networks were used for classification. The training samples were based on tiles, to which class labels were allocated in accordance with the label of the majority of the buildings in each tile. An overall accuracy of 74.8% and a Cohen’s kappa coefficient of 0.60 were achieved. Once again, our results reach a comparable level of accuracy (
Table 1,
Table 2,
Table 3,
Table 4 and
Table 5). Before concluding these comparisons, it is worthwhile to recall that the training data in both studies, those of Wieland et al. [
16] and Bai et al. [
22], were prepared using the information from the MLIT field survey [
44], of which the first report was released on 23 August 2011, five months after the disaster event.
Regarding run-time performance, the run-time strongly depends on the specific details of the hardware and compiler used for analysis. For this paper, the IHF method was implemented in the Python programming language, Version 2.7 [
51], compiled with Microsoft Visual C++ 2008 for 32-bit systems. The Numerical Python (NumPy) library [
52] was used to vectorize the calculations. The classifications reported in this case study were performed on an HP Z240 SFF Workstation (3.70 GHz). The calibration of the linear threshold function from the bi-dimensional dataset as presented in
Section 3.1 took 8.98 s; the calibration of the non-linear threshold function from the bi-dimensional dataset (
Section 3.2) took 9.05 s; and the calibration of the linear threshold function from the seven-dimensional dataset (
Section 3.2) took 11.54 s. These observed run-times are negligible compared with the time required for the acquisition of the SAR imagery. However, the run-time can still be significantly reduced, and this will be addressed in a future publication. For instance, Python, being an interpreted language, exhibits lower performance than a compiled language, such as C++ or Fortran. Moreover, note that the run-times shown represent the time required for 10,000 iterations (
Figure 4h and
Figure 5h), which would not be necessary if the convergence were to be controlled in the algorithm. Furthermore, the gradient descent method for minimizing
could be replaced with a more complex algorithm, in which, for instance, the learning rate (
in Equation (
6)) could be manipulated during run-time to achieve faster convergence.