1. Introduction
High-speed rail is an important part of modern urban transportation [
1,
2,
3,
4], and it is essential to ensure its operational safety [
5,
6,
7,
8,
9]. External environment security is one of the key points of high-speed rail operational security [
10,
11,
12], and unstable objects in the external environment can easily enter the rail area and result in serious security accidents. These objects pose safety hazards around railways. Common categories of safety hazard sources include color-coated steel sheet (CCSS) roof buildings, plastic greenhouses, and dust-proof nets [
13].
CCSSs are rolled from color-painted galvanized (or aluminized) steel sheets and have been widely used in construction. CCSS roof buildings, as a typical type of temporary construction, have been built in large numbers during the process of urban expansion due to their corrosion resistance [
14,
15], easy construction, and low cost. They have also increased in number and size around high-speed rail lines that run through cities. However, their ease of construction leads to the instability of CCSS roofs. Since CCSS roofs are light and the components in CCSS roof buildings are mainly bolted or welded [
16], the CCSS roofs can easily be blown onto the high-speed rail line by high winds, causing operational safety issues. Therefore, it is important to investigate CCSS roof buildings constructed in the external environment regularly for the operational security of high-speed rails.
In practical work, the investigation of CCSS roof buildings surrounding high-speed rails relies on manual field investigations, conducted within 500 m on both sides of the high-speed railway every seven days. This approach is labor-intensive and subject to terrain and weather constraints. The development of remote sensing imaging technology provides a new approach; earth observation systems can quickly acquire large-scale, high-resolution remote sensing images without being constrained by terrain. Identification of CCSS roof buildings in remote sensing images greatly reduces human costs. However, existing CCSS roof building identification still relies on professional visual interpretation. Despite reducing the workload of fieldwork, this method cannot make effective use of massive remote sensing data due to slow processing speeds and requires professional operators.
Over the past years, many studies have been devoted to extracting CCSS roof buildings based on spectral and texture features from high-resolution remote sensing images. Guo et al. [
17] proposed a new spectral index for blue CCSS roof buildings based on Landsat-8 images to map industrial areas; Samat et al. [
18] designed several spectral indexes to enhance and identify blue and red CCSS roof buildings from the Sentinel-2 images and analyzed the correlation between their construction area and urban population; Zhang et al. [
19] established a decision tree model, combining spectral and texture features, to study the spatiotemporal change rule of CCSS roof buildings in Kenya, Africa.
Although these studies performed well on images with obvious selected features, the handcrafted features can be affected by sunlight, seasons, and sensors, making it difficult to use these methods on massive multi-source remote sensing imagery. Thus, it is natural to apply deep-learning technology as a more generalized and intelligent method. CCSS roof building extraction, as a pixel-wise classification task, can be regarded as semantic segmentation in computer vision. Although there has been a lot of work on object extraction from remote sensing images based on deep learning methods, the extraction of CCSS roof buildings is more challenging. First, the scale and density of CCSS roof buildings are related to the scene. For example, in industrial zones and construction land, CCSS roof buildings are large in scale and have high aggregation; the floor area could be tens of thousands of square meters, while in towns and urban villages, they are scattered and occupy only tens of square meters or even less. The great scale variation leads to holes and omissions in extraction masks. Second, the CCSS roof buildings in remote sensing images mostly appear as irregular shapes, which poses challenges to models that have fixed-shape receptive fields. Third, the locations of CCSS roofs are highly diverse due to the convenience of their construction, including industrial zones, urban villages, construction sites, etc. Moreover, in some places, people have illegally built sheds on top of buildings or reinforced the roofs with CCSS, making it more challenging to distinguish CCSS-roofed buildings from complex backgrounds.
In this paper, we are devoted to developing a method that can address the above issues and assist in ensuring high-speed rail external environment security. Our main contributions can be summarized as follows:
We propose a deformation-aware feature enhancement and alignment network (DFEANet) to realize intelligent CCSS roof building identification in the external environment of high-speed rails.
A deformation-aware feature enhancement module (DFEM) is proposed to solve the problem associated with the multiple scales and irregular shapes of CCSS roof buildings. It adjusts the spatial sampling locations of convolutional layers according to the input feature and uncovers implicit spectral features, thus separating these features from the complex background.
A feature alignment and gated fusion module (FAGM) is proposed to suppress interference from the background and maintain structural integrity and details. It mitigates the spatial misalignment between adjacent semantic feature maps and guides the fusion process, thereby reducing the introduction of redundant information.
High-resolution remote sensing images collected from the SuperView-1 satellite are used to evaluate the effectiveness of DFEANet. Compared with six classical and state-of-the-art (SOTA) deep-learning methods, DFEANet achieved competitive performance.
4. Methodology
In this section, we propose a method to improve the performance of CCSS roof building extraction from high-resolution remote sensing images by accurately separating the target from the complex background and reserving structural details. Concretely, we first give an overview of DFEANet. Then the deformable convolution adopted in DFEANet is introduced. Finally, two proposed modules are described in detail.
4.1. Model Overview
Effective feature representation of multi-scale CCSS roof buildings from high-resolution remote sensing images is essential to improving the extraction accuracy of CCSS roof buildings in the external environment of a high-speed rail. The encoder–decoder structure has been verified as effective in coping with scale variance. It extracts multi-level features via the encoder and then fuses them via the decoder to make predictions. In the feature extraction process, as the depth of the encoder increases, the resolution of the feature maps reduces while the semantic information increases. Features of different scales are included in different-level feature maps. Specifically, features of small objects and details are contained in low-level feature maps, while features of large-scale objects are obtained in high-level feature maps. To accurately extract CCSS roof buildings of different scales from complex backgrounds and preserve more detailed information, DFEANet was proposed. The overview of DFEANet is shown in
Figure 3. DFEANet adopts ResNet as the encoder. Deformable convolution is adopted in DFEMs to fit the deformation of CCSS roof buildings in different-level feature maps and separate their features from the complex background. The processed feature maps are then adjusted and fused from top to bottom by FAGMs to reduce the spatial deviation. The gating mechanism is adopted to guide the fusion process and suppress redundant information. Finally, the multi-level feature maps obtained by FAGMs are sampled to a unified scale, concatenated along the channel dimension, and then input into the segment head to generate the final prediction.
4.2. Deformable Convolution
In this study, deformable convolution is used to establish the relationship between pixels in irregular local areas according to the feature shapes, improving the integrity of feature extraction. Traditional convolution samples a rectangular region around the pixel in the input image; thus, the receptive field is mostly rectangular. However, in reality, CCSS roofs often have different shapes, and a regular sampling grid limits the exploration of interpixel relationships at different distances. Therefore, it is desirable to adaptively adjust receptive field sizes to establish interpixel relationships more efficiently and improve feature representation. To this end, Zhu et al. [
40,
41] proposed deformable convolution. It adds learnable two-dimensional offsets to the regular grid-like sampling locations in the standard convolution to flexibly adjust each sample location and modulates them with a learnable feature amplitude, which can control both the spatial distribution and relevance of samples.
In standard convolution, taking the 3 × 3 convolution of dilation as an example, the set of sample locations relative to the central one can be expressed as
For each location
on the output y,
is obtained by sampling the input
as expressed in [
40]
where
is the number of sample locations,
denotes the weight for each sample location, and
enumerates the locations in
.
Deformable convolution [
41] introduces learnable offsets and modulation to the regular grid sample locations, and
obtained by the deformable convolution can be expressed as
where
and
are learnable offsets and modulation scalars, respectively.
adjusts the sample location according to input features, while
, which lies in the range [0, 1], modulates the feature amplitude to suppress irrelevant information.
Figure 4 illustrates the workflow of deformable convolution. For each location, the offsets
and the modulation scalars
are learned from the input feature map by a separate convolution layer. This layer outputs a tensor with the same length and width as the input, and it has
channels. The offsets
in both the x and y directions are recorded in the first
channels, and the remaining
channels correspond to the modulation scalars
. Then, deformable convolution calculates the output following the formulation in Equation (2). In DFEANet, deformable convolutional layers are adopted in DFEM to enhance the spatial representation of multi-level features and alleviate incomplete predictions.
4.3. Deformation-Aware Feature Enhancement Module
The extraction of large-scale CCSS roof buildings often results in incompleteness, while small ones may be omitted from the extraction masks. These issues are usually related to the inefficient separation of different-scale CCSS roof buildings from the background in multi-level feature maps. The complex background in remote sensing images disrupts CCSS roof building extraction. In the encoder–decoder structure, if the multi-level feature maps extracted by the encoder are delivered directly to the decoder, the features of CCSS roof buildings and noise may be transferred indistinguishably, causing the features to become confused by background noise. This confusion is particularly challenging for CCSS roof buildings in remote sensing images, given their irregular shapes and extensive scale range, making them even more difficult to distinguish from the background. Generally, strong dependencies exist among pixels of the same object; capturing more interpixel dependencies leads to more accurate feature extraction. Therefore, it is feasible to improve extraction accuracy by effectively capturing interpixel dependencies. DFEM is proposed to dig out the implicit spatial and spectral relationships between features and the background. Specifically, deformable convolution is adopted in the spatial enhancement and feature extraction parts of DFEM to capture the spatial morphology of the CCSS roof building features.
As shown in
Figure 5, DFEM digs out the relationship between the target and background in the multi-level feature maps across two dimensions: channel and space. Spectral signatures, as the most prominent features of CCSS roof buildings, are primarily contained in the channel dimension of the image. Within this context, the channel branch works to enhance the implicit spectral signatures by weighting the channels. Global average pooling (GAP) [
42] is first adopted to compress the input feature map
along the spatial dimension to a vector
. The input feature map can be considered as a set of single channel feature maps expressed as
. According to the definition of GAP in [
42], the m
channel of vector
can be obtained by calculating the mean of all pixels in
.
Then, two convolutional layers and a ReLU function are applied to obtain the channel weight vector , which is subsequently limited to the range [0, 1] using a sigmoid function. The feature map with enhanced spectral signatures is obtained by multiplying the original input by the channel weight vector.
The spatial branch enhances texture and geometric features by capturing interpixel relationships. To establish these interpixel dependencies, the input feature map is compressed along the channel dimension using a convolutional layer, resulting in a single-channel tensor with spatial information embedded. A deformable convolution is then applied to obtain the spatial weight matrix , which helps to differentiate targets from the background. Notably, since deformable convolution can adjust the sampling position of the convolution kernel based on the input features, it can capture relationships between pixels at varying distances. A sigmoid function is used to constrain the values in the spatial weight matrix to the range [0, 1], and the spatially enhanced feature map is obtained by multiplying the original input feature map by the spatial weight matrix.
To integrate crucial information from both spatial and channel dimensions, the outputs of the two branches are concatenated along the channel dimensions and then passed through a 1 × 1 convolutional layer to reduce dimensionality. Subsequently, a deformable convolution layer is employed to aggregate the pixels belonging to CCSS roof buildings and further distinguish the features from the background. This results in the feature map E, embedded with the relationship between the target and the background, in which the receptive field can adapt according to the input feature.
4.4. Feature Alignment and Gated Fusion Module
Multi-level feature fusion is the primary approach in tackling large differences in CCSS roof building scales. Many feature fusion methods overlook the semantic gap caused by down-sampling operations. They either directly add the low-level feature maps to the up-sampled high-level feature maps or concatenate them along the channel dimension. This can lead to the misclassification of boundaries and small objects. Specifically, the loss of spatial details and locations during down-sampling can result in misalignment between low-level and high-level feature maps. Moreover, important information regarding different-scale CCSS roof buildings is often embedded in the differences among multi-level feature maps. Indiscriminate fusion of adjacent feature maps may inundate them with excessively redundant information. To handle these problems, we propose FAGM. Similar to [
43,
44], we consider that the relative offset between feature maps resembles the optical flow between adjacent video frames, and we align the feature maps accordingly. Furthermore, a gating mechanism is employed to select important information and guide the fusion process.
Figure 6 shows the structure of FAGM. In this structure, the high-level feature map
is first up-sampled to match the size of the low-level feature map
and then concatenated with
along the channel dimension. Subsequently, the concatenated feature map passes concurrently through two branches, generating two offset maps,
and
. Each offset map has two channels corresponding to offsets in the x and y directions.
and
are aligned by a warping function based on
and
, which resamples the feature maps using bilinear interpolation. The aligned features are then concatenated along the channel dimension and evaluated by the gate map
, whose values are restricted to within the range [0, 1] via a sigmoid function. The high-level and low-level feature maps are fused by this gating mechanism, which controls the information flow in the fusion process, thereby improving fusion efficiency. According to the structure illustrated in
Figure 6, the FAGM could be performed as
where
denotes bilinear interpolation and
denotes the warping function.
Three FAGMs integrate features of adjacent feature maps from top to bottom, passing the high-level semantics down.
Figure 7 visualizes the feature maps and offset maps in the third FAGM, where the aligned feature maps preserve a clearer structure, leading to more consistent representations of CCSS roof buildings. Moreover, in the output feature map, the gating mechanism effectively suppresses background noise while preserving the boundaries.
4.5. Segmentation Head
To generate the final prediction, as shown in
Figure 8, the multi-level feature maps are first up-sampled to a uniform size and then concatenated along the channel dimension, denoted as U. To improve the precision of boundaries, the feature map
, generated by stage0 as depicted in
Figure 3, serves to guide the generation of the output mask via an alignment module. Specifically, similar to the FAGM, the up-sampled U and
are first concatenated along the channel dimension. Subsequently, an offset map
is produced, which the warp function utilizes to align the boundaries of
with
. Ultimately, a prediction mask with the same size as the input image is generated through a series of layers.
4.6. Loss Function
Considering the imbalance between foreground and background in remote sensing images, we selected focal loss [
45] as the loss function. The focal loss is a variation of the cross-entropy loss, which focuses training on hard examples.
where
denotes the predicted probability. Here we use the same values for
and
as in [
45].
6. Conclusions
The utilization of deep learning for the extraction of CCSS roof buildings from remote sensing images can effectively aid in the intelligent inspection of such buildings within the high-speed rail external environment. Previous research utilized models designed based on natural imagery, thereby overlooking the unique characteristics of CCSS roof buildings within remote sensing images. In this study, DFEANet was proposed to improve the accuracy of CCSS roof building extraction. To improve the integrity of irregular roof extraction, DFEM was proposed to perform feature enhancement and adaptive receptive field adjustment. Moreover, FAGM was proposed to preserve boundary details while suppressing background noise.
Quantitative and qualitative analysis based on the remote sensing images of the Beijing–Zhangjiakou high-speed railway demonstrates the superior performance of DFEANet in the extraction of CCSS roof buildings. It can accurately identify CCSS roof structures while ensuring edge accuracy and effectively striking a balance between model complexity and accuracy. The ablation experiments and visualizations further verified the effectiveness of the proposed two modules. In practical applications, by deploying the proposed method on a big data platform and integrating the extracted results with geographic information, it is possible to carry out statistical analysis of potential risk targets. This assists railway personnel in promptly identifying and rectifying safety hazards, thus ensuring the safety of the external environment of the high-speed rail.