1. Introduction
Earthquakes, one of nature’s most catastrophic occurrences, can seriously harm a building’s structural integrity [
1]. Building damage must be classified promptly and accurately in order for government rescue and emergency response efforts to be successful. Building damage evaluation benefits greatly from the extensive coverage, high resolution, and excellent timeliness of remote sensing images [
2,
3].
Building extraction and classification from high-resolution, pre-disaster remote sensing images has long been a major area of interest for both industry professionals and academics. Classical techniques for extracting buildings from remote sensing images can be divided into two primary types based on feature extraction: data-driven techniques and model-driven techniques [
4]. Data-driven approaches identify building targets based on specific combination rules by primarily taking into account low-level information seen in remote sensing photos, such as lines, corners, texture regions, shadows, and height disparities. Model-driven approaches use some high-level global features to direct processes, including item extraction, image segmentation, spatial connection modeling, and contour curve development towards building targets. These approaches begin with the semantic model and a pre-existing understanding of the aim of the entire building.
As artificial intelligence and computer vision have advanced, deep learning has gained popularity as a method for acquiring high-level features that are both discriminative and representative [
5]. Deep learning is more flexible and capable than conventional techniques since it can directly learn high-level characteristics from raw data [
6]. Convolutional Neural Networks (CNNs) have been extensively utilized in the last few years to recognize damaged buildings [
7].
Some scholars have conducted earthquake-damaged house classification research using deep learning methods. XView2 is a large-scale remote sensing image dataset used for building detection and classification in natural disasters. This dataset provides high-resolution aerial images before and after disasters, classifying various damage types into four categories, namely no damage, minor damage, major damage, and destroyed. For example, Weber et al. (2020) used Unet on the Xiew2 dataset to extract the pre-disaster and post-disaster image features of buildings. They located buildings based on pre-disaster images and classified the damage categories of buildings based on post-disaster images. The overall F1 score was a weighted combination of 30% localization F1 and 70% damage classification F1, achieving an overall F1 score of 74.1% [
8]. There are also scholars who divide the disaster level into two categories, namely collapsed and not collapsed. Wang et al. (2021) proposed an improved network, OB-UNet, and built two benchmark datasets (YSH and HTI) for identifying damaged buildings based on post-earthquake images from China and Haiti in 2010, obtaining a mIoU of 66.77% on the YSH dataset and 70.95% on the HTI dataset [
9]. Cui et al. (2022) proposed an improved Swin Transformer method, achieving a mIoU of 88.53% [
10]. The aforementioned method primarily utilizes semantic segmentation. Additionally, some scholars divide post-disaster images into multiple smaller sub-images and classify damage within each sub-image. In this approach, the entire covered area within each sub-image is identified as a single damage level, and the results are then assembled to achieve an overall assessment of the area’s damage. For example, Cooner et al. (2016) used CNNs to classify high-resolution earthquake remote sensing images, dividing the disaster level into two categories, namely damaged and undamaged, quickly detecting damaged buildings, with an accuracy rate of 55% for the 2010 Haiti 7.0 magnitude earthquake [
11]. Ma et al. (2020) combined remote sensing images with block vector data, divided the disaster level into three categories, serious damage, moderate damage, and slight damage, and improved the Inception V3 architecture; the test precision for post-earthquake aerial images of the Yushu earthquake reached 90.07% [
12]. Ji et al. (2020) used a pre-trained VGG model to identify collapsed buildings in pre- and post-earthquake remote sensing images of the 2010 Haiti earthquake, dividing the disaster level into two categories, namely collapsed and not collapsed. The results showed that the adjusted VGGNet model outperformed the original VGGNet model trained from scratch, improving the overall accuracy from 83.38% to 85.19% [
13]. In addition, some scholars have used target detection methods to solve this problem. Gadhave et al. (2023) used YOLOv8 on the xView2 dataset for building detection and classification (four categories) tasks, achieving the best
of 58.3% [
14]. Jing al. (2022) used an improved YOLOv5 on the Yunnan Yangbi dataset for collapsed building detection tasks, reaching
of 90.94% [
15]. Wang et al. (2023) proposed an improved BD-YOLOv5 for collapsed building detection tasks, improving the F1 score from 94.08% to 95.34% [
16].
In past research, the method of dividing images into smaller patches and classifying each sub-image has been utilized, and each sub-image region is identified as a single damage level. Its advantage lies in achieving higher accuracy. However, in high-resolution Unmanned Aerial Vehicle (UAV) imagery, this method may split a single building across different sub-images, resulting in multiple damage states for a single target. Additionally, for irregularly shaped buildings, this approach may struggle to accurately determine building edges and can only provide a rough estimation of the building’s damage status.
Figure 1b illustrates this effect. The object detection method cannot accurately extract building outlines as it can only predict the location of damaged buildings, treating the entire predicted bounding box as a single instance of the damaged building area (
Figure 1c). When the shape of the building is irregular, this method tends to overestimate the predicted damaged area, which is not conducive to performing a fine-grained assessment after an earthquake. Semantic segmentation, which has had positive results in earlier research, is generally the foundation of current deep learning techniques for extracting buildings with varying degrees of damage from high-resolution remote sensing images. However, in complex geographical contexts and low-altitude UAV data, where targets are much larger compared to satellite imagery, semantic segmentation may lead to edge connections between adjacent buildings. Confusion regarding the building boundaries may be the cause of this, which is detrimental to edge extraction and contour fitting later on [
17]. Additionally, semantic segmentation, which classifies each pixel, may result in multiple classification results for different parts of the same target building. In principle, each building instance should only have one state. However, many weaker disasters result in partial damage to buildings, with damaged areas often much smaller than undamaged areas. Undamaged samples still dominate the learning process, causing pixel-level models to segment damaged building instances into several different area categories [
8,
18], as shown in
Figure 1d.
Instance segmentation encompasses object detection and semantic segmentation, allowing for the precise identification and differentiation of various building instances within an image. It also accurately delineates the boundaries of each building instance. Mask R-CNN [
19], a trailblazing study in instance segmentation that involves simultaneously predicting segmentation masks and bounding boxes, has made notable progress. Currently, according to the mask segmentation principle, instance segmentation is mainly divided into two categories: single-stage and two-stage. Representative two-stage methods include Mask R-CNN, Cascade Mask R-CNN [
20], and HTC (Hybrid Task Cascade for Instance Segmentation) [
21]. Single-stage methods include YOLACT [
22], CondInst [
23], SOLO [
24], SOLOv2 [
25], YOLOv5-Seg, and YOLOv8-Seg. When it comes to real-time performance and detection speed, single-stage detection methods outperform two-stage methods. For example, Yildirim et al. (2023) applied Mask-R-CNN to remote sensing data collected after the Kahramanmaraş earthquake in Türkiye. They categorized the disaster level into two classes, namely intact and collapsed. Experimental results showed that the Mask R-CNN model with a ResNet-50 backbone produced the most accurate results, successfully distinguishing intact and collapsed buildings, with average precisions (AP) of approximately 81% and 69%, respectively [
26]. Zhan et al. (2022) proposed an improved Mask-R-CNN for a dataset of earthquakes in the Kyushu region of Japan in April 2016, categorizing the disaster level into four categories (no damage, slight damage, severe damage, and collapsed). They achieved
improvements from 33.2% to 36.5% and
improvements from 33.3% to 37.3% [
27].
In the research based on post-earthquake UAV images, instance segmentation methodology stands out for its ability to meet the diverse data requirements of post-earthquake rescue efforts compared to other research methods. Not only does it enable the extraction and classification of buildings at different damage levels from images, but it also allows for the counting of individual buildings and the differentiation between different instances of buildings. This approach is better suited for the fine-grained needs of earthquake damage assessment and rescue operations. However, as far as the authors are aware, there has not been much research on instance segmentation in building damage categorization. [
17].
Based on the research of domestic and international scholars, this paper proposes an improved YOLOv5s-Seg algorithm for extracting buildings at different damage levels from post-earthquake UAV images. This paper’s primary innovations are:
This paper is different from most of the existing studies that use semantic segmentation or target detection methods for building damage classification. This paper adopts the instance segmentation method to classify the damage level of buildings after the earthquake when dealing with UAV imagery, which can generate more precise acquisition of building locations and damaged areas.
To address the issue of classification errors between adjacent categories of damaged buildings due to similar texture features, such as between the Intact and Slight categories and the Severe and Collapse categories, this paper has incorporated the MLCA (Mixed Local Channel Attention) mechanism into the backbone. This combines local and global features, as well as channel and spatial feature information, enhancing image feature extraction and thus improving the accuracy between the Intact and Collapse categories.
In order to enhance the segmentation precision of small buildings, this paper replaced the Neck part with the Neck part from ASF-YOLO (A Novel YOLO Model with Attentional Scale Sequence Fusion). This Neck part can strengthen the feature information of small targets, thereby improving the model’s detection and segmentation precision.
3. Results
YOLOv5-Seg offers several training models of varying sizes, n, s, m, l, and x, to satisfy the varied requirements of numerous application situations (full details as shown in
Table 5). In particular, the smallest and fastest model is YOLOv5n-Seg. YOLOv5x-Seg, on the other hand, has the most parameters out of all the models and provides the best accuracy but at a slower execution pace.
Taking into account the demands of real-time tasks and device deployment for model parameters, this study especially focuses on YOLOv5s-Seg.
The comparative experimental results are shown in
Table 6 and
Figure 11. From the results data, it is evident that the improved YOLOv5s-Seg model exhibits certain advantages in terms of
evaluation metrics, model parameters, and inference speed compared to other algorithms. In terms of the
metric, both for bounding boxes (
) and segmentation (
), the improved YOLOv5s-Seg model performs better than alternative methods. Most notably, the
of YOLOv5s-Seg reaches 68.84%, which is 0.56% higher than the best comparative algorithm YOLOv9-Seg, and
is 0.15% higher. These results show that the modified approach considerably raises the model’s accuracy in identifying and classifying post-earthquake building damage. Furthermore, the improved YOLOv5-Seg model shows a notable edge over YOLOv7-Seg [
41] and the most recent YOLOv9-Seg [
42] when comparing the quantity of model parameters.
As can be seen in
Figure 11, the model’s parameters are noticeably smaller than those of other models with a comparable level of precision. The parameter count of YOLOv5s-Seg is 7.66 M, which is less than half of YOLOv7-Seg’s (37.86 M) and YOLOv9-Seg’s (27.4M), with similar
. This indicates that the improved model maintains excellent accuracy while drastically reducing the computational burden and model size, thereby improving the model’s efficiency. Furthermore, with a comparatively minor increase in model parameters, the improved YOLOv5s-Seg model exhibits a notable improvement in
when compared to the earlier YOLOv7-Seg and YOLOv9-Seg models, indicating the efficacy of the improvement technique. This benefit is ascribed to the suggested enhancement techniques, which include the enhanced Neck module and the MLCA attention mechanism. These improvement strategies have, to some extent, enhanced the model’s performance while controlling the increase in model complexity, increasing the model’s viability and usefulness in real-world scenarios.
When compared to previous models, the improved YOLOv5s-Seg model attains an optimal compromise between model accuracy and inference speed. As demonstrated in
Table 6, the improved model outperforms other models of the same accuracy in terms of speed. It is worth noting that the improved YOLOv5s-Seg maintains a good FPS (Frames Per Second) value while retaining accuracy, reaching 142 in all experiments, which is approximately three times faster than models with similar accuracy, such as YOLOv9-Seg and YOLOv7-Seg. This means that the model has a significant advantage in real-time tasks, such as real-time detection of UAV videos during post-earthquake rescue missions.
Although the improved model in this paper is slightly slower compared to the original YOLOv5s-Seg model, its accuracy has been significantly improved. Researchers commonly think that in scenarios such as post-earthquake rescue and evaluation, a good model tries to achieve a compromise between fast detection speed and high accuracy. The goal of the improved YOLOv5s-Seg model presented in this work is to improve outcomes in both aspects.
4. Discussion
In order to methodically assess each augmentation strategy’s contribution to the YOLOv5-Seg model’s overall performance, ablation research was carried out in this section. The following summarizes the YOLOv5s-Seg models with different improvement strategies: MLCA and ASF.
Table 7 lists the improvement tactics that were used in each experiment. In order to assess further trials, Experiment 1 (originating from YOLOv5s-Seg) is used as the baseline, where no improvement tactics were used. It can be useful to comprehend the effects of various combinations of these tactics on the model’s performance through four separate tests.
The results of the ablation experiments are presented, showing an increasing trend in segmentation and detection accuracy with the gradual addition of improvement strategies. In the meantime, the model’s parameters increase somewhat in comparison to the initial model. This pattern shows how the MLCA attention mechanism and ASF can improve model performance. The accuracy gains for each category in the ablation trials are listed in
Table 8.
From
Table 9, it can be observed that compared to YOLOv5s-Seg itself,
is higher by 1.07% and
is higher by 1.11%. However, the increase in the number of parameters of the model is minimal, which means that memory usage during model runtime remains almost unchanged. Although the computational complexity (about 9.7%) and inference speed of the model have slightly increased (about 1.7 ms) compared to the original model, considering the accuracy improvement it brings and the fact that it can still meet the needs of rapid evaluation tasks, the improvement method in this paper is satisfying. At the same time, for tasks that focus on building positioning and damage classification, an improved model with only the MLCA attention mechanism (Experiment 2) can be chosen. This model exhibits the highest
accuracy and the parameters, computational complexity, and inference time of the model are basically the same as the original model, while the accuracy is significantly improved.
Due to the specificity of post-earthquake rescue operations, people are more concerned about the performance of the model in each category of building damage severity. For example, in real-time rescue operations, more manpower and resources can be allocated to areas with higher degrees of damage, while resources can be saved in undamaged areas, greatly improving rescue efficiency. The model performs best in the Intact and Collapse categories, with
and
in the intact category increasing by 2.73% and 2.73%, respectively, and increasing by 2.58% and 2.14% in the collapse category, respectively. These modifications improve the performance of the model for applications requiring high precision and computational efficiency, while also enhancing the accuracy of categories of interest in rescue missions, making it an ideal solution.
Figure 12 illustrates the comparison between the improved YOLOv5s-Seg model and the original model’s prediction results.
Although this study has advanced and the model has demonstrated some identification capabilities in the experiments, there are still certain obstacles and restrictions that require more investigation. For example, due to the perspective from unmanned aerial vehicle (UAV) images, the roofs of buildings are often in the same plane as the roads, making the extraction of dense buildings challenging. Furthermore, due to variations in regional architectural structures, the features of buildings differ across different regions, thus requiring improvement in the model’s generalization ability. Due to a lack of UAV images from other earthquakes, the model’s generalizability to different types of earthquakes and varied building structures is not thoroughly tested. It is a limitation of this research.
Future research will mostly concentrate on the following directions due to the aforementioned limitations: adding UAV images from different perspectives to the dataset to obtain more detailed information about buildings, such as wall damage, in order to improve the accuracy of building damage classification; and enlarging the training dataset to include more post-disaster UAV images of buildings in different types of earthquakes and building structures in different regions, which will improve the model’s identification abilities and increase the model’s applicability. In order to improve overall rescue efficiency, this research will investigate adaptive learning processes in the future. This will make the model more resilient and adaptable to highly dynamic, complex, and challenging environments, including building structures.
5. Conclusions
This paper addresses the classification problem of identifying damaged building categories in post-earthquake UAV images and proposes an improved YOLOv5s-Seg instance segmentation algorithm. This research includes activities such as UAV image data collection, data annotation, comparative experiments, and ablation studies.
Several enhancements are introduced to the YOLOv5s-Seg model, including the MLCA attention mechanism and ASF small object detection enhancement in the Neck part. Ablation experiments demonstrate that these enhancement strategies gradually improve the mAP value compared to the original model, while also improving performance on the Intact and Collapse categories, providing valuable decision support for resource allocation during rescue operations.
This research uses comparative trials with other widely used instance segmentation algorithms to demonstrate the superiority of the improved YOLOv5-Seg model in post-earthquake building damage assessment. The improved model outperforms the current methods and can meet the demands of real-time rescue missions with fewer model parameters, higher accuracy, and faster inference speed. Overall, this study thoroughly examines how to apply and improve the YOLOv5s-Seg instance segmentation algorithm for the evaluation of building damage from UAV images following earthquakes, as well as how this improved algorithm can offer algorithmic approaches to post-earthquake rescue operations.