First, the crack image is inputted into the U-Net network to extract features of crack images. The U-Net network uses an encoder–decoder structure to extract features of different scales and fuses features of the same scale of encoder and decoder to make the crack feature more prominent. Second, the extracted features are inputted into the morphological-processing network, and the white-top hat and black-bottom hat transformation are performed, respectively, to correct the influence of uneven illumination on the captured crack images. Then, the output features of the U-Net network and the morphological-processing network are fused and inputted into the side network to obtain a crack prediction map at each scale. Finally, the multi-scale prediction results are fused to obtain the final crack segmentation result, and the loss function of each scale and the final prediction map loss are fused into the final loss function.
3.1. Multi-Scale Feature Extraction Based on U-Net Network
This paper selects the U-Net network as the feature extraction module. As shown in
Figure 2, the U-Net network consists of the left encoding part, the right decoding part and the next two convolution and activation layers. The encoding part consists of four repeating structures, each of which consists of two 3 × 3 convolutional layers, nonlinear ReLU layers and a 2 × 2 max pooling layer, which correspond to the blue and red arrows in
Figure 2, respectively. The decoding part is similar to the encoding part and also consists of four repeating structures. Deconvolution up-conv is used before each repeating structure to halve the number of channels and double the size of the feature map, corresponding to the green arrow in the figure. The deconvolution result is concatenated with the feature map of the same scale from the corresponding encoding part, which corresponds to the white/blue blocks of each repeating structure. The concatenated feature map is subjected to two 3 × 3 convolutions, which correspond to the blue arrows in the decoding part. At the last layer of the network, the feature map with 64 channels is converted into the prediction result of whether it is a crack through 1 × 1 convolution, corresponding to the cyan arrow. As shown in
Table 1, the parameters of the U-Net network structure are listed.
In this paper, a pavement crack image with a size and dimension of is used as the input image of the U-Net network, and the three features of the decoder part with sizes of , and are extracted and sent to the morphological network for depolarization processing.
3.2. Morphological Network
When the camera takes a picture of the crack, different parts of the crack perhaps appear as different colors and brightness values due to the change of light and shadow and different shooting angles, which is called a polarized light phenomenon. This phenomenon will lead to the same crack being treated as different objects, thus affecting the performance of target segmentation. In order to overcome the influence of polarized light on object segmentation performance, a morphological network is designed to enhance the output features of the U-Net network to highlight the bright cracks on a dark background and the dark cracks on a bright background.
The morphological network consists of the white-top hat transformation
and the black-bottom hat transformation
:
where
represents the output feature of the U-Net network,
represents the morphological erosion operation on
and
represents the morphological expansion operation on
. Here,
and
are dilation and erosion filters, and
and
represent the resolution and number of channels of the extracted features, respectively.
and
K are the size and the number of the filters.
For
, the dilation
and erosion
operations on the feature map
are as follows:
The value ranges of
and
are as follows:
Examples of dilation and erosion operations are shown in
Figure 3 and
Figure 4, respectively. The erosion operation is to make the dark area in the image larger, and the dilation operation is to make the bright area in the image larger. The functions of dilation and erosion are to eliminate noise, segment independent cracks, connect adjacent elements in the image and find the maximum or minimum area in the image. Therefore, subtracting morphologically manipulated images from the original image (and vice versa) can highlight areas that are brighter or darker than that in the original image, thus correcting the effects of uneven illumination. The structure diagram of the morphological network is shown in
Figure 5, which consists of parallel erosion and dilation processing layers, a subtracting operation and a concatenation operation. The
feature maps of the expansion layer are obtained by convoluting
expansion filters
with the output features of U-Net. In the same way, the
feature maps of the erosion layer are obtained by convolving the
erosion filters
with the output features of the U-Net. After expansion and corrosion processing, the white-top hat feature and black-bottom hat feature are obtained through difference operation. The difference feature map and morphological processing feature map are concatenated in series, weighted by linear combination and convolved by a 1 × 1 filter to obtain the output feature map of the morphological network.
In order to verify the effectiveness of the morphological network module, feature maps before and after the morphological network processing are extracted, respectively, which are shown in
Figure 6. As shown in
Figure 6, there is a large difference between the extracted crack shape and the ground truth data without morphological processing, and there are many false detection areas. Compared to the feature map before morphological processing, the feature map after morphological processing is closer to the ground truth. The experimental results verify the effectiveness of the morphological processing.
3.3. Side Network and Loss Function
To obtain the crack prediction result, the output feature of the morphological network on each scale is inputted to the side network for channel merging and up-sampling. The specific operations are as follows:
(1) The enhanced feature sized of is performed a 1 × 1 convolution operation to obtain the dimension reduction feature with the number of channels being 1. Through two 2 × 2 up-sampling convolution operations, the up-sampled feature with the same size as the original map is obtained. Then, the prediction map of the first scale is obtained by the activation function processing, which is recorded as .
(2) The prediction result of the second scale is obtained after the enhanced feature sized of is conducted in a 1 × 1 convolution operation, a 2 × 2 up-sampling and activation function processing, which is denoted as .
(3) The prediction result of the third scale is obtained by only activation function processing, which is expressed as with the size of .
From the observation of the crack images, it can be seen that the size of the crack takes a small proportion in the image. That is to say, the number of negative samples used for model training is far greater than the number of positive samples. A large number of negative samples will encourage the model to ignore the learning of positive samples, which will lead to a poor prediction of positive samples and a low F1 value. For solving the sample imbalance problem in the model training, the Dice Loss is used to reduce the learning degree of simple negative samples, and thus to improve the segmentation performance. The Dice Loss is calculated as follows.
Given two sets A and B, their Dice similarity coefficient
is defined as Equation (7), and its value range is [0, 1]:
where
is the number of elements in the intersection of
A and
B, and
and
represent the number of elements in the set
A and
B, respectively. In this paper,
A and
B are predicted and true positive sample sets, respectively.
TP (True Positive),
TN (True Negative),
FP (False Positive) and
FN (False Negative) are usually used as evaluation indicators in the binary classification problem. Here,
,
and
, so the Dice coefficient S is adjusted as Equation (8):
Dice Loss
is defined as follows:
On the one hand, Dice Loss-based network learning directly uses the segmentation effect evaluation index as the loss function, and on the other hand, a large number of background pixels are ignored in the calculation of the ratio between the intersection and the union. Therefore, the Dice Loss function is chosen to solve the problem of uneven positive and negative samples of crack images and to improve the convergence speed at the same time.
In order to make full use of the multi-scale information of the crack, this paper calculates the loss of the prediction results at each scale and then fuses the loss of the final prediction result to obtain the objective loss function. The design of the multiple loss function is shown in
Figure 7.
The final crack prediction result
is obtained by concatenation and 1 × 1 convolution of multi-scale prediction results. The objective loss function
is the sum of the loss on all scales
and the final prediction loss
, which is used to adjust the training of the network model until it converges to the desired value, and thus, the network model is obtained.
where
and
are the predicted result and the ground-truth of the
scale, and
is the total number of scales, which is set as three in the experiment.
Figure 8 shows the crack image, its prediction results on three scales such as
,
and
and the final prediction result
.
is the final prediction result by performing a 1 × 1 convolution on the concatenation of the features of three scales. As can be seen from
Figure 8, shallow features
and
pay more attention to the location information of cracks, but there is a relatively large difference between the prediction result and the ground-truth due to the down-sampling operation.
is more accurate at the details of the crack image due to the combination of deep features and shallow features through cross-layer connections. The final prediction result after fusion is the closest to the ground truth, which proves the effectiveness of multi-scale prediction fusion.