A Method for Extracting Joints on Mountain Tunnel Faces Based on Mask R-CNN Image Segmentation Algorithm

Qiao, Honglei; Yang, Xinan; Liang, Zuquan; Liu, Yu; Ge, Zhifan; Zhou, Jian

doi:10.3390/app14156403

Open AccessArticle

A Method for Extracting Joints on Mountain Tunnel Faces Based on Mask R-CNN Image Segmentation Algorithm

by

Honglei Qiao

¹,

Xinan Yang

¹

,

Zuquan Liang

¹,

Yu Liu

¹,

Zhifan Ge

¹ and

Jian Zhou

^2,3,*

¹

The Key Laboratory of Road and Traffic Engineering, Ministry of Education, Tongji University, Shanghai 201804, China

²

Department of Civil Engineering, Hangzhou City University, Hangzhou 310015, China

³

Key Laboratory of Safe Construction and Intelligent Maintenance for Urban Shield Tunnels of Zhejiang Province, Hangzhou City University, Hangzhou 310015, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(15), 6403; https://doi.org/10.3390/app14156403

Submission received: 17 June 2024 / Revised: 15 July 2024 / Accepted: 19 July 2024 / Published: 23 July 2024

(This article belongs to the Special Issue Advanced Techniques in Tunnelling)

Download

Browse Figures

Versions Notes

Abstract

:

The accurate distribution of joints on the tunnel face is crucial for assessing the stability and safety of surrounding rock during tunnel construction. This paper introduces the Mask R-CNN image segmentation algorithm, a state-of-the-art deep learning model, to achieve efficient and accurate identification and extraction of joints on tunnel face images. First, digital images of tunnel faces were captured and stitched, resulting in 286 complete images suitable for analysis. Then, the joints on the tunnel face were extracted using traditional image processing algorithms, the commonly used U-net image segmentation model, and the Mask R-CNN image segmentation model introduced in this paper to address the lack of recognition accuracy. Finally, the extraction results obtained by the three methods were compared. The comparison results show that the joint extraction method based on the Mask R-CNN image segmentation deep learning model introduced in this paper achieved the best joint extraction effect with a Dice similarity coefficient of 87.48%, outperforming traditional methods and the U-net model, which scored 60.59% and 75.36%, respectively, realizing accurate and efficient acquisition of tunnel face rock joints. These findings suggest that the Mask R-CNN model can be effectively implemented in real-time monitoring systems for tunnel construction projects.

Keywords:

mountain tunnel; tunnel construction safety; rock mass joints; image processing; deep learning; Dice similarity coefficient

1. Introduction

The development degree of joints on the tunnel face reflects rock mass integrity, which is crucial for dynamic evaluation during tunnel construction. Currently, joint development descriptions rely on hand-drawn records and qualitative judgments, limiting accuracy and efficiency. Digital image capturing is primarily used for records, yet these images contain valuable rock mass information. If an effective method can be established to digitally extract and obtain complete rock mass information from tunnel face images, it would significantly improve the efficiency of real-time dynamic grading of the surrounding rock on construction sites.

Early methods, such as manual counting, were inefficient and prone to human error. Ross-Brown and Atkinson [1] first used camera images for rock mass characterization, while subsequent studies introduced digital image processing techniques [2] for better accuracy and efficiency. Advancements in image processing, such as Fourier and Hough transforms [3,4], grayscale elevation methods [5], and structural analysis techniques [6], have improved joint extraction but still rely heavily on manual intervention and experience. Recent studies [7,8] developed algorithms to overcome these limitations but faced challenges in complex environments. However, traditional image processing methods heavily rely on experience, and their processing effectiveness still needs improvement.

AI algorithms, especially deep-learning-based object detection methods, have shown promise in civil engineering for recognizing cracks in concrete linings [9,10,11,12,13,14,15,16]. However, applying these methods to tunnel rock faces poses challenges due to larger areas, harsh conditions, and complex backgrounds. Many researchers have conducted studies on this topic. For instance, Liu et al. [17], Chen et al. [18], and Lee et al. [19] introduced modifications to existing networks to enhance performance in pixel-level joint extraction. Recent advancements, such as the Path Aggregation Network (PA-Net) [20], have further improved detection accuracy in dark environments.

Using deep learning neural network models for joint extraction is currently the mainstream research direction for tunnel face joint extraction methods. However, the extraction accuracy still needs improvement. Upgrading and adapting the structures of deep learning convolutional neural networks and deepening the learning for specific engineering geological rock masses are the main breakthroughs at present. Among the many image segmentation convolutional neural networks, the U-net network, proposed in 2015 for biomedical image segmentation [21], has been applied by some researchers in the process of crack or joint extraction due to its simple and intuitive architecture, relatively low training data requirements, and ability to generate pixel-level masks [22,23,24,25]. Despite progress, extraction accuracy needs further improvement. This study focuses on enhancing deep learning convolutional neural networks, particularly the Mask R-CNN model [26], for better joint extraction accuracy. The Mask R-CNN, proposed in 2017, offers high precision in object detection and segmentation and is well-suited for complex tasks. It has been widely adopted in various fields [27,28,29,30,31,32]. It can be considered for introduction into the process of extracting complex tunnel face joints.

This paper utilizes data from the Henan Luanlu Expressway Nianpan Tunnel and Zhejiang Hangwen Railway Tunnel to compare joint extraction results using traditional methods, the U-net model, and the Mask R-CNN model. We aim to demonstrate the superiority of the Mask R-CNN in achieving accurate and efficient joint extraction.

2. Acquisition of Tunnel Face Images

Field images of rock masses are the basic data for obtaining joint information on the tunnel face. High-quality images enhance the accuracy and speed of joint extraction. This section introduces the image partitioning acquisition method and the stitching and fusion method of partitioned images, aiming to obtain clear and complete field image data.

2.1. Method of Partitioned Image Acquisition

2.1.1. Principles for Selecting Photography Equipment

Due to the dim lighting, severe dust, low visibility near the tunnel face, and the typically large area of the tunnel face, the requirements for photographic equipment are relatively high.

Selecting a camera with a larger frame allows for a wider field of view, and a larger sensor size results in more detailed images. An aperture of f/2.0 or larger is recommended for low-light conditions in tunnel environments. A focal length of 35 mm to 60 mm is appropriate for tunnel projects, balancing visible distance and field of view. Longer exposure times are good for capturing more information about the rock in low-light environments and are recommended to be between 1/30 s and 1/8 s. These settings ensure high image quality in low-light conditions typical of mountain tunnels. The equipment selected for this study meets the above settings, including a Canon EOS 6D MARK II camera with a 50 mm f/1.4 lens and a DSLR tripod (see Figure 1).

2.1.2. Principles for Selecting Light Sources

To improve image quality under low-light conditions, various lighting equipment such as flash, reflector lamps, and mechanical equipment light sources are used. Each has its advantages and limitations.

Flash is the most direct complementary light source for digital cameras. It produces a strong lighting effect at the moment of exposure. However, in the dusty environment of tunnel engineering, it can easily cause diffuse reflection of dust particles, which affects the imaging quality. Reflector lamps are a type of spotlight with a wide lighting range and a stable light source. They provide better lighting for the tunnel working surface but are less portable and require a power supply that may be inconvenient at the construction site. Mechanical equipment at the tunnel construction site, such as loaders, wet spray trucks, and dump trucks, generally have lighting systems. These light sources have wide coverage and stable illumination, which can provide a good lighting effect for the tunnel working surface. Although they may be interfered with by mechanical shadows, they can be effectively avoided in practice. Therefore, in this study, mechanical equipment shown in Figure 2 is selected for lighting and fill light.

2.1.3. Partitioned Shooting Plan and Timing for Tunnel Face Photography

To obtain high-quality images, the camera should be placed 10–20 m in front of the tunnel face, perpendicular to it. The tunnel face is divided into sections to ensure comprehensive coverage.

Various construction activities can obstruct the tunnel face and complicate photography. During drilling and charging, the drilling jumbo and the tunnel-lined platform car can block the view. During mucking, rubble covers the tunnel face, and high dust concentration makes photographing difficult. During the installation of steel arches and shotcrete application, the tunnel-lined platform car can again block the tunnel face, and the shotcrete process reduces visibility, affecting photo quality. Therefore, the optimal times for photography are after mucking and before installing steel arches, avoiding the adverse interferences shown in Figure 3 and ensuring clear visibility.

This article relies on the Luanchuan–Lushi Expressway Tunnel in Henan and the Hangzhou–Wenzhou Railway Tunnel in Zhejiang (shown in Figure 4), where the tunnel face area is generally less than 100 square meters. Considering the onsite shooting conditions, the shooting plan shown in Figure 5 is adopted: the tunnel face is divided into six sections, and the camera is placed 10 m in front of the face. The optimal time for photography is after mucking and before installing steel arches. During this period, uniform lighting can be provided using the tunnel-lined platform car light source, improving the lighting quality for photographing the tunnel face.

2.2. Stitching and Fusion of Partitioned Photography Images

After obtaining the six partitioned photographic images of the tunnel face, stitching them together to form a complete and clear tunnel face image is a prerequisite for the next step of joint extraction.

When taking the images, to ensure complete coverage of each partition, the area covered by each partition image is often slightly larger than the actual corresponding partition area. This inevitably leads to overlapping images in adjacent regions, making it impossible to achieve image stitching through simple positioning. The following example illustrates the stitching and fusion algorithm of partitioned images on the tunnel face, using the right arch foot region and the floor region of a tunnel face as examples.

2.2.1. Image Stitching of Tunnel Work Face Partitions

As shown in Figure 6, the blue part of the floor partition image overlaps with the red part of the right arch foot partition image. This overlapping area needs to be stitched together.

The image stitching process uses the SURF (Speeded-Up Robust Features) algorithm, which performs faster compared to other algorithms [33]. First, the Hessian matrix of the image is calculated according to Equation (1):

H (f (x, y)) = [\begin{matrix} \frac{\partial^{2} f}{\partial x^{2}} \frac{\partial^{2} f}{\partial x \partial y} \\ \frac{\partial^{2} f}{\partial x \partial y} \frac{\partial^{2} f}{\partial y^{2}} \end{matrix}]

(1)

where H() is the Hessian matrix, and f(x, y) is the color value at the image coordinates (x, y).

Next, the determinant of the Hessian matrix is calculated using Equation (2) to obtain the local extremum points of the pixels, which are used as the SURF feature points of the image.

\det (H) = \frac{\partial^{2}}{\partial x^{2}} \frac{\partial^{2} f}{\partial y^{2}} - (\frac{\partial^{2} f}{\partial x \partial y})

(2)

where H is the Hessian matrix, and f is the color value at the image coordinates (x, y).

After obtaining the feature points of the reference image and the matching image, the similarity of the feature points is calculated using the Euclidean distance criterion shown in Equation (3):

l = \sqrt{\sum_{i = 1}^{n} {(X_{1} (i) - X_{2} (i))}^{2}}

(3)

where

l

is the distance between the two points,

n

is the dimension of the feature points,

X_{1}

is the descriptor vector of the feature point in the reference image, and

X_{2}

is the descriptor vector of the feature point in the matching image.

When the distance is less than the set threshold (found to be optimally between 0.6 and 0.8), the two feature points are considered successfully matched.

2.2.2. Image Fusion of Tunnel Work Face Partitions

After stitching, unnatural color transitions (shown in Figure 7) are corrected using the fade-in–fade-out weighted fusion algorithm shown in Equation (4).

X (a, b) = {\begin{array}{l} x_{1} (a, b) & (a, b) \in x_{1} \\ (1 - γ) x_{1} (a, b) + γ x_{2} (a, b) & (a, b) \in x_{1} \cap x_{2} \\ x_{1} (a, b) & (a, b) \in x_{2} \end{array}

(4)

where

x_{1}, x_{2}

are the images to be stitched,

X

is the stitched image,

γ

is the weighting factor,

γ = \frac{w_{d}}{w} \in (0, 1), w

is the horizontal coordinate distance of the overlapping part of the stitched images, and

w_{d}

is the horizontal coordinate distance of the pixel points in the overlapping part of the stitched images from the start of the overlapping section.

The final complete tunnel work face fusion image compared with the full tunnel work face captured image is shown in Figure 8.

Stitched and fused images significantly improve quality and restore geological structure information, laying a solid foundation for subsequent joint extraction (see Figure 8).

3. Joint Extraction from Tunnel Face Based on Traditional Image Processing Methods

This section employs traditional computer image processing methods for joint extraction from tunnel face images. The main process includes grayscale processing, spatial filtering, image binarization, morphological processing, noise removal, and finally, outputting the joint extraction image of the tunnel face.

The following demonstrates the image processing procedure using the complete tunnel face image obtained through stitching and fusion in Section 2.2, as shown in Figure 8b.

3.1. Grayscale Processing

Grayscale processing reduces image dimensions, facilitating feature extraction by converting RGB images to grayscale using Equation (5). The result is shown in Figure 9.

G r a y = (R + G + B) / 3

(5)

where

G r a y

is the calculated grayscale value of the pixel, R is the red component value of the pixel, G is the green component value of the pixel, and B is the blue component value of the pixel.

3.2. Spatial Filtering

Spatial filtering, particularly bilateral filtering (as shown in Equation (6)), enhances image quality by preserving edges while removing noise.

g (x, y) = \frac{\sum_{k, l} f (k, l) ω (x, y, k, l)}{\sum_{k, l} ω (x, y, k, l)}

(6)

where

f (k, l)

is the pixel value in the neighborhood centered at point

(x, y)

, and

ω (x, y, k, l)

is the weighting coefficient for the neighboring pixel

(k, l)

centered at point

(x, y)

. This coefficient is determined by the product of the spatial kernel and the range kernel, with the expression given by Equation (7):

ω (x, y, k, l) = \exp [- \frac{{(x - k)}^{2} + {(y - l)}^{2}}{2 σ_{d}^{2}} - \frac{{| | f (x, y) - f (k, l) | |}^{2}}{2 σ_{r}^{2}}]

(7)

where

Σ_{d}

is the filter radius of the spatial domain kernel, and

σ_{r}

is the filter radius of the range domain kernel.

The effect of the tunnel face image after bilateral filtering is shown in Figure 10.

3.3. Image Binarization

To separate the tunnel face information from the background, image binarization is necessary. Common binarization methods include the histogram bimodal method, the maximum entropy method, the minimum error method, and the OTSU [34] method.

The OTSU method (Otsu’s thresholding method) is a global adaptive segmentation algorithm that uses image grayscale to divide the image into foreground and background. The maximum between-class variance K is calculated as shown in Equation (8).

K = P_{o} {(μ - μ_{o}^{'})}^{2} + P_{b} {(μ - μ_{b}^{'})}^{2}

(8)

where

μ

is the mean grayscale value of the image,

μ_{o}^{'}

is the mean grayscale value of the foreground,

μ_{b}^{'}

is the mean grayscale value of the background,

P_{o}

is the proportion of the foreground in the entire image, and

P_{b}

is the proportion of the background in the entire image.

The OTSU method is chosen for its efficiency and global adaptive thresholding capabilities, making it suitable for tunnel face image segmentation. The binarization segmentation effect is shown in Figure 11.

3.4. Morphological Processing

Due to the presence of filling materials in the joints and the effects of lighting, pixels that originally belonged to the same joint may appear as disconnected points, as shown in Figure 12a, making it look like multiple joints.

To address the missing pixels shown in Figure 12a, morphological processing is performed, as illustrated in Figure 13. The principle involves the following:

(1): Dilation operation: applying a dilation operation to the two joints in Figure 13a results in the connected joint image shown in Figure 13b.
(2): Erosion operation: performing an erosion operation removes the dilated pixels from non-breakpoint areas while retaining the breakpoint pixels, resulting in Figure 13c.
(3): Merging pixels: finally, merging the new breakpoint pixels with the original joint pixels forms the new joint shown in Figure 13d.

After applying morphological processing to the joints with breakpoints shown in Figure 12a, the result is as shown in Figure 12b. It can be seen that morphological processing connects pixels in joints by applying dilation and erosion operations, effectively addressing disconnected points due to lighting or filling materials.

3.5. Noise Removal

As shown in Figure 14a, after morphological processing of the joints, a large number of noise pixels and significant pixel interference from the surrounding rock still exist in the image. Noise removal is required to address these issues.

Noise removal involves eliminating large surrounding rock areas (as shown in Figure 14b), small noise points (as shown in Figure 14c), and non-joint areas (as shown in Figure 15) through region-growing algorithms and geometric shape analysis.

After removing the non-joint areas and importing the tunnel contour curve, the final recorded structure of the tunnel face is obtained, as shown in Figure 14d.

4. Joint Extraction on Tunnel Faces Based on Image Segmentation Neural Network Models

As can be seen from the tunnel face structure catalog obtained in Section 3, traditional image processing methods for extracting joints are generally ineffective, involve substantial manual intervention, have a complex processing workflow, and result in some loss of joint information. This makes it difficult to meet the requirements for quick and accurate identification of joints in mountain tunnel engineering faces. To address this, recent image segmentation algorithms have been introduced to achieve more intelligent and accurate extraction of face joints. In this section, based on digital image samples of the face obtained through onsite shooting and stitching, the U-Net convolutional neural network algorithm and the Mask R-CNN convolutional neural network algorithm are employed for learning and recognition extraction. The extraction results are then analyzed and compared.

4.1. Data Collection, Annotation, and Augmentation

4.1.1. Onsite Data Collection

In image recognition, the dataset is the foundation for training and evaluation, and selecting an appropriate dataset is crucial for the algorithm’s performance and accuracy. The onsite tunnel face image collection was carried out as described in Section 2.1, and the collected partitioned digital images were stitched and fused using the algorithm described in Section 2.2. The onsite tunnel face image collection resulted in 1,716 partitioned photographs, which were stitched into 286 complete images.

4.1.2. Data Annotation

Data annotation with the interactive segmentation annotation software EISeg 1.1.1 (Efficient Interactive Segmentation 1.1.1) (shown in Figure 16) is crucial for accurately marking joint areas and facilitating precise model training. The effect of segmentation is shown in Figure 17.

4.1.3. Dataset Augmentation

Convolutional neural network learning requires a large number of image samples. To increase the number of image samples, data augmentation operations such as left–right flipping, up–down flipping, rotation, and translation were performed on the annotated images. A total of 8580 annotated images were obtained. The augmented image samples are shown in Figure 18.

4.2. Joint Extraction of Tunnel Face Based on U-Net Deep Learning Architecture

4.2.1. U-Net Convolutional Neural Network Architecture

The U-Net convolutional neural network was proposed in 2015 [21] and has achieved good results in the field of medical image cell segmentation. The U-Net network, suitable for joint extraction, classifies all pixels in an image. Its convolutional neural network structure is shown in Figure 19.

4.2.2. U-Net Convolutional Neural Network Parameter Selection

The input image size of this convolutional neural network is 512 × 512. After four down-sampling and four up-sampling processes, the output image size remains 512 × 512, the same as the input. The U-Net uses a 3 × 3 convolution kernel, ReLU activation function, and 2 × 2 max pooling for down-sampling.

(1): Convolution layer parameter selection

The convolution kernel size is 3 × 3, and its convolution processing principle is shown in Figure 20. The sliding step of the convolution kernel is 1. To ensure that the image size after convolution remains consistent with the original image, the original image needs to be padded with a value of 0. The number of output image channels depends on the number of convolution kernels in the convolution layer.

(2): Activation function selection

In the U-Net convolutional network structure shown in Figure 19, the convolution layers indicated by blue arrows correspond to the Rectified Linear Unit (ReLU) activation function shown in Equation (9). Its corresponding function graph is shown in Figure 21.

f (x) = {\begin{matrix} 0, x \leq 0 \\ x, x > 0 \end{matrix}

(9)

The ReLU function is chosen for several reasons. Firstly, it increases the non-linearity of the network, which is essential for learning complex patterns. Secondly, it improves the computational speed due to its simple mathematical operation. Lastly, unlike the sigmoid function, ReLU does not suffer from the vanishing gradient problem. The vanishing gradient problem occurs when gradients used for updating neural network weights diminish, making training ineffective. ReLU avoids this issue by allowing gradients to flow through the network without significant diminishment, making it particularly suitable for large-scale convolution operations. This justification highlights why ReLU is preferred in deep learning applications, particularly in convolutional neural networks (CNNs).

(3): Pooling method selection

Pooling is a down-sampling method that can reduce the image size and help prevent overfitting. There are two main types of pooling: max pooling and average pooling.

The principle of max pooling is shown in Figure 22, where the maximum pixel value within the neighborhood is taken as the center pixel value.

The principle of average pooling is shown in Figure 23, where the average pixel value within the neighborhood is taken as the center pixel value.

This paper introduces the U-Net network to identify the structural information of the tunnel face. To maximize the distinction between structural information and background information, a 2 × 2 max pooling method is used for image down-sampling.

4.2.3. Analysis of U-Net Convolutional Neural Network for Tunnel Face Joint Extraction

The 8580 sample images were divided into training, validation, and test sets in a 60%, 20%, and 20% ratio. The preprocessed dataset was input into the U-Net convolutional neural network written in PyTorch using Python 3.7 for training. By calculating the loss and accuracy, the network parameters were iteratively updated to minimize the loss on the validation set. Once the minimum loss value stabilizes, the model converges.

Under this U-Net convolutional neural network structure, the changes in the loss function and accuracy for the training and validation sets are shown in Figure 24 and Figure 25, respectively.

As shown in Figure 24 and Figure 25, the U-Net achieved an accuracy of 82.2% on the training set and 82.6% on the validation set, stabilizing at epoch 29.

The trained U-Net convolutional neural network was used to test the test set. A comparison of randomly selected predicted images and their corresponding labeled images is shown in Figure 26.

As seen in Figure 26, despite good segmentation, it struggled with ‘rough edges’ and smaller targets.

Comparing the predicted groups, it is evident that when the segmented target occupies a larger proportion of the total image, the overall segmentation effect is better. However, in the fourth and fifth groups, where the segmented target occupies a smaller proportion of the total image, non-target images appear after segmentation. This issue is related to the principle of U-Net, which calculates classification loss pixel by pixel. When the target segmentation object occupies a small portion of the entire image, the iterative loss value can easily drop very low, making it difficult for the target to be fully segmented.

To address the shortcomings of semantic segmentation methods like the U-Net neural network, the author uses an instance segmentation algorithm that combines object detection and semantic segmentation, Mask R-CNN, to extract joints. This approach allows for precise segmentation of object edges based on bounding boxes from object detection, achieving more accurate segmentation results.

4.3. Joint Extraction of Tunnel Face Based on Mask R-CNN Deep Learning Architecture

4.3.1. Mask R-CNN Convolutional Neural Network Architecture

The Mask R-CNN convolutional neural network [26] was proposed by He et al. in 2017. It adds an FCN (Fully Convolutional Network) structure to the Faster-RCNN network, achieving precise segmentation while detecting objects. Its network architecture is shown in Figure 27.

As shown in Figure 27, the Mask R-CNN combines object detection and instance segmentation, using ResNet and FPN for feature extraction and ROIAlign for accurate pooling.

4.3.2. Mask R-CNN Convolutional Neural Network Parameter Selection

The input image size for this convolutional neural network is 512 × 512. After feature map generation, region proposal, and region extraction, the final output is an image of size 512 × 512 with a mask overlay, class labels, and target region positions. This study uses ResNet101 + FPN for the backbone and ROIAlign for pooling, enhancing small object recognition accuracy.

(1): Backbone architecture parameter selection

The backbone architecture of the Mask R-CNN convolutional neural network consists of ResNet + FPN. The commonly used configurations are ResNet50 + FPN and ResNet101 + FPN. The network structures of ResNet50 and ResNet101 are compared in Table 1.

As shown in Table 1, the ResNet101 network has a deeper structure compared to ResNet50, allowing for more precise extraction of image details. Therefore, this study uses ResNet101 + FPN as the backbone architecture for the research on tunnel face joint extraction.

(2): Anchor calibration rules in RPN

Unlike simply dividing anchors into positive and negative samples based on IoU values, this study uses the Non-Maximum Suppression (NMS) [35] method for iterative calculation, as demonstrated by the principle in Equation (10). First, the box with the highest score is selected, and the IoU value of other boxes is calculated against this box. If the IoU value is greater than 0.6, the box is marked as a negative sample. Then, the box with the next highest score is selected for the next iteration, continuing until the process is complete.

s_{i} = {\begin{matrix} s_{i}, I o U (M, b_{i}) < N_{t} \\ 0, I o U (M, b_{i}) \geq N_{t} \end{matrix}

(10)

where

s_{i}

is the score value of the

i

-th box,

M

is the target box,

b_{i}

is the proposal box with the highest confidence, and

N_{t}

is the IoU threshold, set to 0.6 here.

(3): ROI pooling method selection

For networks based on the R-CNN architecture, ROI pooling methods mainly include ROIPooling and ROIAlign. Both aim to map features to a fixed-size feature map. The difference lies in that ROIPooling rounds off pixel values during pooling, while ROIAlign retains the floating-point values of pixels using bilinear interpolation, as shown in Figure 28. ROIAlign provides higher accuracy for small object recognition. To obtain detailed images of tunnel face joints, this study uses the ROIAlign pooling method.

4.3.3. Analysis of Mask R-CNN Convolutional Neural Network for Tunnel Face Joint Extraction

In this experiment, the dataset is the same as in Section 4.1.3. The 8580 sample images of size 512 × 512 are divided into training and test sets in an 80% to 20% ratio. The preprocessed dataset is input into the Mask R-CNN convolutional neural network for training. The initial learning rate is set to 0.001, and the maximum number of iterations (max epochs) is set to 200. The classification loss (

l o s s_{c l s}

), localization loss (

l o s s_{b b o x}

), segmentation loss (

l o s s_{m a s k}

), and total loss are calculated, as shown in Equation (11).

l o s s = l o s s_{c l s} + l o s s_{b b o x} + l o s s_{m a s k}

(11)

When the loss function reaches its minimum value and stabilizes, the model converges. The changes in the loss functions are shown in Figure 29.

As shown in Figure 29, the Mask R-CNN achieved stable loss values at epoch 35, with localization loss lower than classification and segmentation losses. The trained Mask R-CNN convolutional neural network is then used to test the test set. The comparison of the five groups of predicted images with their corresponding labeled images, as in Section 4.1.3, is shown in Figure 30.

As shown in Figure 30, after classification, bounding box selection, and mask calculation, the Mask R-CNN network prediction results achieve good joint segmentation effects compared to the annotated results of the original images. Additionally, comparing the prediction results of the U-Net network shows that the joint segmentation effect is not affected by the proportion of the segmentation target in the image. Both the overall and local details are accurately segmented.

4.4. Comparison of Tunnel Face Joint Recognition Effect and Acquisition of Joint Morphology Parameters

Figure 31 presents the prediction results of five test sample images after traditional image processing, the U-Net convolutional neural network, and the Mask R-CNN convolutional neural network.

From Figure 31, it is evident that overall, all three image segmentation methods achieve certain joint segmentation effects. Specifically, the Mask R-CNN convolutional neural network demonstrates the best segmentation results, followed by the U-Net convolutional neural network, and traditional image processing shows the least effective results. To further quantitatively compare the segmentation effectiveness of these three methods, appropriate metrics will be selected for subsequent comparative analysis.

4.4.1. Evaluation Metrics

(1): Dice similarity coefficient

The Dice coefficient is used to quantify the overlap between the predicted image and the annotated image, calculated as shown in Equation (12).

Dice = 2TP/(2TP + FN + FP)

(12)

where

TP

represents true positives, which are predicted positive samples that are indeed positive; FN represents false negatives, which are predicted negative samples that are indeed positive; and FP represents false positives, which are predicted positive samples that are indeed negative.

(2): Precision

Precision represents the proportion of predicted positive samples that are actually positive, calculated as shown in Equation (13).

Precision = TP/(TP + FP)

(13)

(3): Recall

Recall represents the proportion of actual positive samples that are predicted correctly, calculated as shown in Equation (14).

Recall = TP/(TP + FN)

(14)

4.4.2. Comparison of Recognition Effects

Table 2 evaluates the segmentation effectiveness of three segmentation methods—traditional image processing, U-Net convolutional neural network, and Mask R-CNN convolutional neural network—using the metrics described in Section 4.4.1.

As shown in Table 2, the Dice similarity coefficient, Precision, and Recall of traditional image processing are 60.59%, 57.31%, and 70.03%, respectively. For the U-Net network, the Dice similarity coefficient, Precision, and Recall are 75.36%, 70.85%, and 85.58%, respectively. For the Mask R-CNN network, these values are 87.48%, 89.74%, and 84.73%, respectively.

Among the three metrics, the Dice similarity coefficient most accurately reflects the true effect of target segmentation. Although the Recall value of U-Net is slightly higher than that of Mask R-CNN, its Precision is significantly lower, indicating that the U-Net network’s segmentation results are rougher and contain more non-target information points. Comprehensive comparison and analysis show that the Mask R-CNN network has the best segmentation effect on the face joints of the tunnel.

5. Conclusions

Based on the digital images of the tunnel face obtained through sectional shooting, this paper obtained complete and clear images of the tunnel face through image stitching and fusion algorithms. Then, the tunnel face joint information was extracted using three methods: traditional image processing, U-Net convolutional neural network, and Mask R-CNN convolutional neural network. The extraction effects were compared, and the main conclusions are as follows:

(1): Using the SURF algorithm and weighted fusion, sectional images of the tunnel face were stitched into complete, high-clarity images suitable for deep learning algorithms.
(2): Traditional image processing methods, including grayscale processing, spatial filtering, binarization, morphological processing, and noise removal, produced suboptimal results with a Dice similarity coefficient of 60.59%. These methods are inefficient, involve significant manual intervention, and lose joint information, making them unsuitable for tunnel engineering applications.
(3): The U-Net convolutional neural network achieved relatively good segmentation results with a Dice similarity coefficient of 75.36%. However, it lacked precision and lost target details, indicating room for improvement.
(4): The Mask R-CNN model excelled in both overall and detailed segmentation, achieving a Dice similarity coefficient of 87.48%. This model demonstrated efficient and accurate extraction of tunnel face joints, outperforming traditional and U-Net methods.

Author Contributions

Data curation, Y.L.; formal analysis, H.Q.; funding acquisition, X.Y.; investigation, H.Q.; methodology, H.Q.; project administration, X.Y.; resources, Y.L.; software, H.Q. and, Z.L.; supervision, J.Z.; validation, H.Q.; visualization, Z.G.; writing—original draft, H.Q.; writing—review & editing, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This study did not require ethical approval.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ross-Brown, D.M.; Atkinson, K. Terrestrial photogrammetry in open-pits: 1-description and use of the Phototheodolite in mine surveying. Inst. Min. Metall. 1972, 81, 7–11. [Google Scholar]
Huang, S.L.; Speck, R.C. Digital image processing for rock joint surface studies. Photogramm. Eng. Remote Sens. 1988, 54, 395–400. [Google Scholar]
Krishnan, R.; Sommer, H.J. Estimation of Rock Face Stability; The Pennsylvania State University: University Park, PA, USA, 1994. [Google Scholar]
Fitton, N.; Cox, S. Optimising the application of the Hough transform for automatic feature extraction from geoscientific images. Comput. Geosci. 1998, 24, 933–951. [Google Scholar] [CrossRef]
Reid, T.R.; Harrison, J.P. A semi-automated methodology for discontinuity trace detection in digital images of rock mass exposures. Int. J. Rock Mech. Min. Sci. 2000, 37, 1–5. [Google Scholar] [CrossRef]
Holden, E.-J.; Dentith, M.; Kovesi, P. Towards the automated analysis of regional aeromagnetic data to identify regions prospective for gold deposits. Comput. Geosci. 2008, 34, 1505–1513. [Google Scholar] [CrossRef]
Liu, C.; Wang, B.; Shi, B.; Tang, C. Analytic method of morphological parameters of cracks for rock and soil based on image processing and recognition. Chin. J. Geotech. Eng. 2008, 30, 1383–1388. [Google Scholar]
Chen, B.; Wang, Y.; Wang, H.; Zhu, C.; Fu, J. Identification of tunnel surrounding rock joint and fracture based on SLIC super pixel segmentation and combination. J. Highw. Transp. Res. Dev. 2022, 39, 139–146. [Google Scholar]
Jung, S.Y.; Lee, S.K.; Park, C.I.; Cho, S.Y.; Yu, J.H. A method for detecting concrete cracks using deep-learning and image processing. J. Archit. Inst. Korea Struct. Constr. 2019, 35, 163–170. [Google Scholar]
Bhowmick, S.; Nagarajaiah, S.; Veeraraghavan, A. Vision and deep learning-based algorithms to detect and quantify cracks on concrete surfaces from UAV videos. Sensors 2020, 20, 6299. [Google Scholar] [CrossRef] [PubMed]
Yu, Y.; Rashidi, M.; Samali, B.; Yousefi, A.M.; Wang, W. Multi-image-feature-based hierarchical concrete crack identification framework using optimized SVM multi-classifiers and D-S fusion algorithm for bridge structures. Remote Sens. 2021, 13, 240. [Google Scholar] [CrossRef]
Zhao, S.; Zhang, D.; Xue, Y.; Zhou, M.; Huang, H. A deep learning-based approach for refined crack evaluation from shield tunnel lining images. Autom. Constr. 2021, 132, 103934. [Google Scholar] [CrossRef]
Dang, L.M.; Wang, H.; Li, Y.; Park, Y.; Oh, C.; Nguyen, T.N.; Moon, H. Automatic tunnel lining crack evaluation and measurement using deep learning. Tunn. Undergr. Space Technol. 2022, 124, 104472. [Google Scholar] [CrossRef]
Zhou, Z.; Zhang, J.; Gong, C. Hybrid semantic segmentation for tunnel lining cracks based on Swin Transformer and convolutional neural network. Comput.-Aided Civ. Infrastruct. Eng. 2023, 38, 2491–2510. [Google Scholar] [CrossRef]
Song, F.; Liu, B.; Yuan, G.X. Pixel-level crack identification for bridge concrete structures using unmanned aerial vehicle photography and deep learning. Struct. Control. Health Monit. 2024, 2024, 1299095. [Google Scholar] [CrossRef]
Wang, F.; Chen, T.; Gai, M. A dual-tree-complex wavelet transform-based infrared and visible image fusion technique and its application in tunnel crack detection. Appl. Sci. 2024, 14, 114. [Google Scholar] [CrossRef]
Liu, H.X.; Li, W.S.; Zha, Z.Y.; Jiang, W.J.; Xu, T. Method for surrounding rock mass classification of highway tunnels based on deep learning technology. Chin. J. Geotech. Eng. 2018, 40, 1809–1817. [Google Scholar]
Chen, J.; Zhou, M.; Huang, H.; Zhang, D.; Peng, Z. Automated extraction and evaluation of fracture trace maps from rock tunnel face images via deep learning. Int. J. Rock Mech. Min. Sci. 2021, 142, 104745. [Google Scholar] [CrossRef]
Lee, Y.-K.; Kim, J.; Choi, C.-S.; Song, J.-J. Semi-automatic calculation of joint trace length from digital images based on deep learning and data structuring techniques. Int. J. Rock Mech. Min. Sci. 2022, 149, 104981. [Google Scholar] [CrossRef]
Peng, L.; Wang, H.; Zhou, C.; Hu, F.; Tian, X.; Hongtai, Z. Research on intelligent detection and segmentation of rock joints based on deep learning. Adv. Civ. Eng. 2024, 2024, 8810092. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the 18th International Conference, Munich, Germany, 5–9 October 2015. [Google Scholar]
Li, G.; Ma, B.; He, S.; Ren, X.; Liu, Q. Automatic tunnel crack detection based on U-Net and a convolutional neural network with alternately updated clique. Sensors 2020, 20, 717. [Google Scholar] [CrossRef]
Chang, H.; Rao, Z.; Zhao, Y.; Li, Y. Research on tunnel crack segmentation algorithm based on improved U-Net network. Comput. Eng. Appl. 2021, 57, 215–222. [Google Scholar]
Zhao, S.; Zhang, G.; Zhang, D.; Tan, D.; Huang, H. A hybrid attention deep learning network for refined segmentation of cracks from shield tunnel lining images. J. Rock Mech. Geotech. Eng. 2023, 15, 3105–3117. [Google Scholar] [CrossRef]
Shi, Y.; Ballesio, M.; Johansen, K.; Trentman, D.; Huang, Y.; McCabe, M.F.; Bruhn, R.; Schuster, G. Semi-universal geo-crack detection by machine learning. Front. Earth Sci. 2023, 11, 1073211. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 16th IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Lin, Z.; Ji, K.F.; Leng, X.G.; Kuang, G. Squeeze and excitation rank faster R-CNN for ship detection in SAR images. IEEE Geosci. Remote Sens. Lett. 2019, 16, 751–755. [Google Scholar] [CrossRef]
Yu, Y.; Zhang, K.L.; Yang, L.; Zhang, D. Fruit detection for strawberry harvesting robot in non-structural environment based on Mask-RCNN. Comput. Electron. Agric. 2019, 163, 104846. [Google Scholar] [CrossRef]
Jia, W.; Tian, Y.; Luo, R.; Zhang, Z.; Lian, J.; Zheng, Y. Detection and segmentation of overlapped fruits based on optimized mask R-CNN application in apple harvesting robot. Comput. Electron. Agric. 2020, 172, 105380. [Google Scholar] [CrossRef]
Hao, Z.; Lin, L.; Post, C.J.; Mikhailova, E.A.; Li, M.; Chen, Y.; Yu, K.; Liu, J. Automated tree-crown and height detection in a young forest plantation using mask region-based convolutional neural network (Mask R-CNN). ISPRS J. Photogramm. Remote Sens. 2021, 178, 112–123. [Google Scholar] [CrossRef]
Xu, X.Y.; Zhao, M.; Shi, P.X.; Ren, R.; He, X.; Wei, X.; Yang, H. Crack detection and comparison study based on faster R-CNN and mask R-CNN. Sensors 2022, 22, 1215. [Google Scholar] [CrossRef] [PubMed]
Qin, J.; Zhang, Y.; Zhou, H.; Yu, F.; Sun, B.; Wang, Q. Protein crystal instance segmentation based on Mask R-CNN. Crystals 2021, 11, 157. [Google Scholar] [CrossRef]
Bay, H.; Tuytelaars, T.; van Gool, L. SURF: Speeded up Robust Features; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Otsu, N. Threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 1979, 9, 62–66. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar]

Figure 1. Image acquisition equipment.

Figure 2. Tunnel-lined platform car light source.

Figure 3. Adverse Interferences in Tunnel Face Photography. (The red boxes are where the tunnel face is obscured).

Figure 4. Map of tunnel locations.

Figure 5. Onsite digital image shooting plan. (Each number corresponds to a part of the tunnel face that is divided.

Figure 6. Image overlapping area.

Figure 7. Unnatural edges in image stitching. (As the yellow box marks).

Figure 8. Comparison of the Effects Between Stitched and Fused Images and Full Cross-Section Photographed Images.

Figure 9. Grayscale processing result of tunnel face image.

Figure 10. Bilateral filtering effect on tunnel face image.

Figure 11. Binarized image.

Figure 12. Comparison of Joints Before and After Morphological Processing.

Figure 13. Schematic diagram of morphological processing.

Figure 14. Image noise removal process.

Figure 15. Comparison of non-joint and joint areas.

Figure 16. Main interface view of EISeg annotation software.

Figure 17. Using EISeg software for joint data annotation.

Figure 18. Dataset augmentation operations. (The orange line is added later to determine the direction of the picture).

Figure 19. U-Net convolutional neural network architecture.

Figure 20. Schematic diagram of convolution processing.

Figure 21. ReLU Function Graph.

Figure 22. Max pooling diagram.

Figure 23. Average pooling diagram.

Figure 24. Changes in Loss Values for Training and Validation Sets.

Figure 25. Changes in accuracy for training and validation sets.

Figure 26. Comparison of U-Net prediction results.

Figure 27. Mask R-CNN network architecture.

Figure 28. Bilinear interpolation effect.

Figure 29. Changes in loss values.

Figure 30. Comparison of Mask R-CNN prediction results. (The red boxes in subfigure (c) are the identified joints).

Figure 31. Comparison of prediction results.

Table 1. Comparison of ResNet network structures.

Layer Name	50-Layer	101-Layer
Conv1	7 × 7, 64, stride 2
Conv2_x	3 × 3 max pool, stride 2
Conv2_x	$[\begin{matrix} \begin{matrix} 1 \times 1, & 64 \\ 3 \times 3, & 64 \end{matrix} \\ \begin{matrix} 1 \times 1, & 256 \end{matrix} \end{matrix}] \times 3$	$[\begin{matrix} \begin{matrix} 1 \times 1, & 64 \\ 3 \times 3, & 64 \end{matrix} \\ \begin{matrix} 1 \times 1, & 256 \end{matrix} \end{matrix}] \times 3$
Conv3_x	$[\begin{matrix} \begin{matrix} 1 \times 1, & 128 \\ 3 \times 3, & 128 \end{matrix} \\ \begin{matrix} 1 \times 1, & 512 \end{matrix} \end{matrix}] \times 4$	$[\begin{matrix} \begin{matrix} 1 \times 1, & 128 \\ 3 \times 3, & 128 \end{matrix} \\ \begin{matrix} 1 \times 1, & 512 \end{matrix} \end{matrix}] \times 4$
Conv4_x	$[\begin{matrix} \begin{matrix} 1 \times 1, & 256 \\ 3 \times 3, & 256 \end{matrix} \\ \begin{matrix} 1 \times 1, & 1024 \end{matrix} \end{matrix}] \times 6$	$[\begin{matrix} \begin{matrix} 1 \times 1, & 256 \\ 3 \times 3, & 256 \end{matrix} \\ \begin{matrix} 1 \times 1, & 1024 \end{matrix} \end{matrix}] \times 23$
Conv5_x	$[\begin{matrix} \begin{matrix} 1 \times 1, & 512 \\ 3 \times 3, & 512 \end{matrix} \\ \begin{matrix} 1 \times 1, & 2048 \end{matrix} \end{matrix}] \times 3$	$[\begin{matrix} \begin{matrix} 1 \times 1, & 512 \\ 3 \times 3, & 512 \end{matrix} \\ \begin{matrix} 1 \times 1, & 2048 \end{matrix} \end{matrix}] \times 3$

Table 2. Comparison of joint segmentation effects. (The bolded portion is the highest value.)

Method	Dice (%)	Precision (%)	Recall (%)
Traditional Image Processing	60.59	57.31	70.03
U-Net	75.36	70.85	85.58
Mask R-CNN	87.48	89.74	84.73

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qiao, H.; Yang, X.; Liang, Z.; Liu, Y.; Ge, Z.; Zhou, J. A Method for Extracting Joints on Mountain Tunnel Faces Based on Mask R-CNN Image Segmentation Algorithm. Appl. Sci. 2024, 14, 6403. https://doi.org/10.3390/app14156403

AMA Style

Qiao H, Yang X, Liang Z, Liu Y, Ge Z, Zhou J. A Method for Extracting Joints on Mountain Tunnel Faces Based on Mask R-CNN Image Segmentation Algorithm. Applied Sciences. 2024; 14(15):6403. https://doi.org/10.3390/app14156403

Chicago/Turabian Style

Qiao, Honglei, Xinan Yang, Zuquan Liang, Yu Liu, Zhifan Ge, and Jian Zhou. 2024. "A Method for Extracting Joints on Mountain Tunnel Faces Based on Mask R-CNN Image Segmentation Algorithm" Applied Sciences 14, no. 15: 6403. https://doi.org/10.3390/app14156403

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Method for Extracting Joints on Mountain Tunnel Faces Based on Mask R-CNN Image Segmentation Algorithm

Abstract

1. Introduction

2. Acquisition of Tunnel Face Images

2.1. Method of Partitioned Image Acquisition

2.1.1. Principles for Selecting Photography Equipment

2.1.2. Principles for Selecting Light Sources

2.1.3. Partitioned Shooting Plan and Timing for Tunnel Face Photography

2.2. Stitching and Fusion of Partitioned Photography Images

2.2.1. Image Stitching of Tunnel Work Face Partitions

2.2.2. Image Fusion of Tunnel Work Face Partitions

3. Joint Extraction from Tunnel Face Based on Traditional Image Processing Methods

3.1. Grayscale Processing

3.2. Spatial Filtering

3.3. Image Binarization

3.4. Morphological Processing

3.5. Noise Removal

4. Joint Extraction on Tunnel Faces Based on Image Segmentation Neural Network Models

4.1. Data Collection, Annotation, and Augmentation

4.1.1. Onsite Data Collection

4.1.2. Data Annotation

4.1.3. Dataset Augmentation

4.2. Joint Extraction of Tunnel Face Based on U-Net Deep Learning Architecture

4.2.1. U-Net Convolutional Neural Network Architecture

4.2.2. U-Net Convolutional Neural Network Parameter Selection

4.2.3. Analysis of U-Net Convolutional Neural Network for Tunnel Face Joint Extraction

4.3. Joint Extraction of Tunnel Face Based on Mask R-CNN Deep Learning Architecture

4.3.1. Mask R-CNN Convolutional Neural Network Architecture

4.3.2. Mask R-CNN Convolutional Neural Network Parameter Selection

4.3.3. Analysis of Mask R-CNN Convolutional Neural Network for Tunnel Face Joint Extraction

4.4. Comparison of Tunnel Face Joint Recognition Effect and Acquisition of Joint Morphology Parameters

4.4.1. Evaluation Metrics

4.4.2. Comparison of Recognition Effects

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI