3.1. Multi-Scale Feature Prediction Fusion
In deep learning, multiscale networks refer to the use of different scales in the same neural network to process the input data. This approach improves the receptive field of the network to better process the objects in the input image, allowing the network to process objects of different sizes. The receptive field (RF) [
35] is the area where the pixels in the input image have an impact on the output of the neural network. In a convolutional neural network, the receptive field of each neuron can be understood as the input area of that neuron. By understanding the receptive field size of a neuron, one can better understand how the neural network understands and represents the input image. The size of the receptive field has a significant impact on the performance of a neural network. Smaller receptive fields typically recognize smaller local features (e.g., edges, corner points, etc.), while larger receptive fields allow for the recognition of larger features (e.g., objects and scenes). In addition, a larger perceptual field can help the network to understand contextual information better, thus improving its performance. Usually, multiscale networks include two aspects: multiscale input and multiscale feature extraction. Therefore, by using multi-scale networks, more comprehensive information can be extracted, which includes both global overall information and local detailed information.
Multi-scale feature prediction fusion (MFPF) [
36] is a commonly used multi-scale neural network structure that is widely used in computer vision fields such as image classification, target detection, semantic segmentation, etc. The core idea is to predict the output of the features in the model at different scales, and then fuse the predicted results.
The main purpose of this section is to combine the multi-scale feature prediction fusion idea with the ordinary U-Net network, improve the original U-Net network into a multi-scale feature fusion U-Net network, and apply it to the pavement crack segmentation. The improved U-Net network retains the original convolutional kernel pool part, with the difference that the intentional fusion prediction is a feature of the middle layer of the U-Net network. In addition, the output of the middle layer is restored to the size of the original target image by a bilinear interpolation operation before the feature fusion prediction, because the size of the feature map output at each scale is different after the target image is down-sampled by the encoder network. Finally, the feature maps of different scales are channel-stitched to obtain a tensor of a larger dimension. This tensor can then be processed using a convolution operation to obtain the final multi-scale feature prediction fusion output. This preserves the information at different scales and efficiently fuses it to improve the performance and generalization of the model. This approach can improve the perceptual field of the network, obtain contextual information at different scales, and enhance the network’s ability to understand and judge the target features. The output of this multi-scale feature prediction fusion can be mathematically expressed as:
where
denotes the output after feature prediction fusion, and
denote the up-sampled output feature maps at different scales.
The structure of the U-Net network combined with multi-scale feature prediction fusion is shown in
Figure 3, from which it can be seen that the pre-processed pavement crack images enter from the input of the network model, and through the multi-layer convolution operation and down-sampling process of the encoder network, the feature maps at different scales can be obtained, which have different resolutions and semantic information. Usually, the shallower feature maps contain more local information, while the deeper feature maps contain more global information. These feature maps are partly sent to higher levels and partly sent directly to the corresponding decoder network as input through a jump connection, and after the up-sampling operation, they are reduced to their original size and a prediction branch is output; finally, the feature images of the prediction branches at different scales are fused into the final output, so that the combination of global information at different scales is achieved.
3.3. New Parallel Attention Module
Although the U-Net network used in this paper is based on the encoder–decoder structure, which has great advantages in image semantic segmentation, the encoder–decoder structure will focus too much on the local information in the image when extracting features from the target image, and thus cannot take into account the global information of the image well, resulting in insufficient accuracy of the final segmentation results.
This paper proposed an attention mechanism module that is connected to the output of the multi-scale feature prediction fusion U-Net network, and the role of this module is to obtain richer global contextual information about the target image. The attention mechanism module proposed in this section was improved based on the parallel attention module in the dual-attention network (DANet) [
40], and the structure of the parallel attention module in DANet is shown in
Figure 6, from which it can be seen that this attention mechanism module is composed of a spatial attention module and a channel attention module by parallelism. The feature maps passing through this attention mechanism module will be processed by the spatial attention and channel attention mechanisms, separately, and the channel descriptions of the two sets of feature maps will be obtained after processing, and then these two sets of channel descriptions will be summed up.
The parallel attention module consists of a parallel channel attention module (CA) and a position attention module (PA), where the PA module emphasizes the position dependency between any two different positional features in the image and the CA module is used to focus on the channel dependency between different channels of the image. By summing the outputs of the two attention modules, similar features of subtle objects were selectively aggregated to highlight their feature representations, while reducing the influence of salient objects on image segmentation, and similar features at any scale were adaptively integrated from a global perspective to improve the segmentation accuracy of the images.
In order to enable the network model to associate more global features as well as to fully relate contextual relationships so that more global contextual relationships can be encoded as local features, this paper made modifications based on the parallel attention module by keeping the original SA module and replacing the gated attention mechanism module (GCT Block) with the CA module. A new parallel attention module was constructed for processing the output of multi-scale feature prediction fusion U-Net to further explore the global contextual information of images. The GCT module is also a channel attention module, which enables more efficient global context modeling of feature fusion maps and is more lightweight with fewer parameters than the common PA module. The structure of the proposed new parallel attention module is shown in
Figure 7.
The CA module adjusts the channel features of the feature map by calculating the importance weights of each channel to improve the perception of key features. Specifically, each channel of the feature map is first downscaled by a global pooling layer, then two fully connected layers are passed to obtain the channel weight coefficients, and finally the weight coefficients are applied to the original feature map to obtain the channel attention-adjusted feature map. The GCT module is also a channel attention module. The structure diagram of the gated channel conversion attention module is shown in
Figure 8.
Here,
is the activation feature in the convolutional network, where
and
are the spatial height and width, and
is the number of channels.
The module introduces three weights, , , and . The weight is responsible for the adaptive embedding output. The gating weight and the bias are responsible for controlling the activation of the gate.
Here, , where corresponds to each channel of .
The channel attention module in the new parallel attention mechanism module was built based on the GCT module. Among the three main components of global context embedding, channel normalization, and gated adaptation, the GCT module uses a normalization method to establish a competitive or cooperative relationship between the channels. Among them, the normalization operation is parameter-free. To make the GCT module learn, a global context embedding operator was added to the module, which embeds the global context and controls the weights of each channel before normalization. A gated adaptive operator was also added, which adjusts the input features on the channel according to the normalized output. This global context embedding operator can encode the broader global contextual semantic information in the image as local features, thus improving feature representation.
The global context part of this module differs from the SE module [
41] in that the gated channel transformed attention module does not use global pooling (GAP) because GAP can fail in some cases. For example, in some applications, instance normalization is used, which fixes the mean value of each channel, so that the resulting vector becomes constant. Therefore, this paper used the L2 norm for global context embedding (GCE). The module is defined as:
where the parameter
is defined as
, the channel will not participate in channel normalization when
tends to zero, and
is a very small constant that avoids the problem of deriving the derivative at the zero point.
The normalization method establishes a competitive relationship between channels so that the larger values of the channel response become relatively larger and suppress the other channels with smaller feedback.
is used in this module to perform channel normalization.
The gated adaptive part was designed to control the activation of channel features by introducing a gating mechanism where the GCT module both competes and cooperates during the training process, where the weights
and the bias
are designed to control whether the channel features are activated or not. When the feature weight
of a channel is positively activated, the GCT module will facilitate the “competition” between the features of this channel and the features of other channels. When a channel’s feature
is negatively activated, the GCT facilitates “cooperation” between the features of this channel and the features of other channels. Gated adaptive is defined as follows:
where
.
In addition, when the gating weights and gating bias are 0, the original features are allowed to be passed to the next layer.
The PA module, on the other hand, focuses on the correlation between different positions in the feature map to better capture the contextual information of the target object. The position features of the feature map were adjusted by calculating the importance weights of each position location to enhance the target information in the feature map. Specifically, the feature map passed through two branches to calculate the horizontal and vertical position attention coefficients, and then the two coefficients were multiplied together to obtain the final position attention coefficients. Finally, the position attention coefficients were applied to the original feature map to obtain the spatial attention-adjusted feature map. The structure of the PA module is shown in
Figure 9.
It can be seen from
Figure 7 that the shape of the input feature map A is
. The input feature map passed through three convolution layers to obtain three new feature maps, which are represented as feature map B, feature map C, and feature map D. Then, a reconstruction operation (reshape) was performed on it, and after the reconstruction, its shape became
, where
, and then a transpose operation was performed on the feature map B to obtain
. A relationship matrix
formed by only
points on
and
without considering the channel factor can be obtained, where the shape is
. The relation matrix
was then converted to a probability distribution expression by a further normalization operation (SoftMax). Then, the relationship matrix Z was converted to a probability distribution expression by a normalization operation (SoftMax). The
matrix simply calculates the relationship between two points of a pixel in position space, and then multiplies it by the scale factor
, after which it is multiplied with the feature map D matrix to obtain
. Next,
was expanded and reshaped to obtain
, i.e., it was restored to the original shape of the input feature map so that it could be summed with the input feature map to obtain the output. Here, the relationship matrix
is generated by the formula:
is generated by the formula:
3.4. Improved Loss Function
So that the improved model does not suffer from loss non-convergence in training, the loss function was improved in this section.
The cross-entropy loss function is often used as a loss function for segmentation networks, and it can be used in either binary classification tasks or multi-classification tasks. The output of this paper contained only two types of crack regions and other background regions, which belong to the binary classification task. The probability of predicting a pixel point as a crack is assumed to be
, and the probability of predicting it as a background is
. At this time, the cross-entropy loss L is defined as shown in Equation (9):
In the formula, N represents the total number of samples, z represents the classification of sample i, and q refers to the probability that the sample is predicted to be cracked. The cross-entropy loss function gives the same weight to the task of whether the sample is cracked or background. However, usually the crack-occupied area in the pavement crack image is small, resulting in a serious imbalance in the number of positive and negative samples. For this problem, this section used the weighted loss function Focal Loss as the loss function of the multiscale U-Net network.
The weighted loss function is mainly improved on the basis of the cross-entropy loss function, and the improved loss function will improve the model’s focus on hard-to-classify samples and positive samples.
The weighted loss function adds a new factor
. By adjusting the value of the factor
to control the size of the weight of positive and negative samples for the overall loss, the smaller the value of
that can reduce the weight of negative samples, the more improved the cross-entropy loss
, as shown in Equation (10):
where
and
represent the number of positive and negative samples. The value of
is determined by the ratio of the number of positive and negative samples.
The weighted loss function incorporates a moderation factor,
, which serves to adjust the weights of the easy-to-classify samples and the hard-to-classify samples. When a sample is misclassified, it is considered a hard-to-classify sample, and the correct prediction probability
converges to 0. Then, the conditioning factor
converges to 1. Conversely, when
converges to 1, the sample is considered as an easily classifiable sample and
converges to 0.
By doing so, the effect of the loss value of the easily classified samples on the overall loss value was attenuated.