1. Introduction
Underground mining is a relatively special scene for autonomous landing applications. On the one hand, there are often obstacles such as gravel and water in mine roads that affect safe driving, and the lack of light in mine roads leads to blurred details of underground mine road images, low contrast, and difficulty in identifying features, and transportation accidents occur from time to time. On the other hand, due to the harsh working environment of the mine, safety accidents and frequent occupational diseases, it is very difficult to recruit mine car drivers, and only some older mine car drivers are still holding on. Although the treatment level of mine car drivers is constantly improving, younger drivers are rarely willing to engage in this industry. Therefore, it is an inevitable trend for mine transportation to develop unmanned and intelligent operation [
1].
With the rapid development of deep learning, road detection technology using deep learning has gradually become a hot research field, and many achievements have been made in the field of road recognition [
2,
3,
4,
5,
6,
7]. In the identification of structured roads, structured roads have clear and standard lane lines, clear road boundaries, and are relatively easy to identify, so road areas can be identified by detecting lane lines. Many researchers have carried out a lot of work using semantic segmentation to identify lane lines [
8,
9,
10,
11,
12,
13]. For unstructured roads, such as rural roads, underground mine roads, etc., although there is no standardized lane line, many researchers have offered their solutions. Zhang et al. established a semantic segmentation network model based on unstructured road features [
14] and the segmentation accuracy was significantly improved, but the real-time performance was poor and it could not be truly applied to actual engineering projects. Aiming at the problems of poorly defined road edges and the distant road segmentation effect, Gong et al. improved the network feature extraction capability by integrating the attention mechanism, and adopted a lightweight feature extraction network to greatly reduce the amount of model parameters under the premise of ensuring the recognition accuracy [
15]. However, its accuracy and robustness are still poor.
The mining area is a relatively special scene for the application of autonomous driving. The poor conditions of the mine are mainly manifested in poor road conditions, more gravel and more dust in the mining area. Many mines are mined along the mountain; the road construction is difficult and the service life is short. Therefore, the road surface of the mining area is mostly paved with gravel, and the road conditions are complex, which is more likely to damage vehicle parts and shorten vehicle life. In addition, there is much dust in the mining area, and the dust particles are large. The sensing equipment is easily covered by dust and this affects the environmental perception ability of the unmanned mine car. In addition to the impact of the complex road environment on the unmanned mine car, there are also some problems in the versatility, reliability and safety of the technology. First, in different mining areas, the mine car will fail due to some unique situations; the electromagnetic characteristics of metal ore will affect the performance of millimeter wave radar, and the temperature of alpine mining areas will affect the stability of electrical and electronic components. Second, the environment of the mining area is complex and harsh. Dust, flying stones, and so on pose a great threat to the perception layer of the unmanned mine car. The probability of the perception equipment being hit in a mining area with many rockfalls greatly affects its safety. In view of the challenges of mine roads, some researchers have also carried out extensive work [
16,
17,
18,
19,
20,
21,
22]. Wei Xing et al. improved the pyramid attention module of the original bilateral network and the channel attention fusion module [
23,
24] for the detection of underground coal mine tracks, which improved the detection accuracy. Zhou and others realized the real-time identification of the central area, left safety area and right safety area of the underground track by improving the BiSeNet network [
25]. However, due to the lack of information exchange between parallel branches, it cannot achieve good results in the road identification of underground mine slopes, and it mainly studies the straight roads of mines, and does not consider special roads features such as bends and forks. Shi et al. studied a method for rapid detection of rockfall using laser radar, which can reduce the wear of rockfall on mine car tires and reduce mining costs [
26]. Yuan et al. derived the vehicle position and attitude information using ultra-wideband positioning technology to provide support for unmanned positioning technology in underground coal mines [
27]. Wang et al. improved the RRT * algorithm to solve the path planning problem of underground unmanned vehicles using articulated structures and drift environmental conditions. This provides a new method for underground mine vehicle path planning [
28]. Gin et al. proposes a real-time traffic identification method based on a low-power embedded device, NVIDIA Jetson AGX Xavier [
29]. This low-power device embedded into the unmanned mine car can judge the road conditions of the underground mine road in time and ensure the safe driving of the unmanned mine car.
In summary, there are many problems in the application of driverless technology to mine roads, such as the complex environment of mine roads, the fuzzy and discontinuous road boundaries, the uneven road surfaces, and the flying sand. It can be concluded that the special challenges of underground mine roads are as follows:
Underground mine roads have a certain slope, and misidentification may occur when identifying roads through images.
The boundary of underground mine road is not obvious and the light is not sufficient, which makes it difficult to identify the exercise area of underground mine road.
To address the aforementioned challenging issues in the mine environment, this paper studies an underground mine road recognition method using a camera to collect images for real-time semantic segmentation of underground mine slopes, effectively segmenting the drivable area in the mine slope scene, and realizing real-time and accurate recognition of underground mine slopes. The main contributions of this study are as follows:
Aiming at the characteristics of the underground mine slope road environment, we improved the BiSeNet semantic segmentation network.
A dataset of underground mine roads is constructed, and the dataset is augmented by data augmentation.
The model is verified by experiments and compared with other classical models.
2. Materials and Methods
The detection process of the drivable area of the underground mine road image is shown in
Figure 1. It mainly includes two stages of training and application. In the training stage, the mine road image is collected by the camera mounted in front of the mine car, and the dataset is expanded by the data enhancement method, and the driving area of the underground mine road is marked. The underground mine road image dataset is put into the semantic segmentation network for training. When the loss value converges, the model is trained. In the application stage, the driving area of the underground mine road can be detected by inputting the image of the underground mine road into the trained model.
BiSeNet is used as the basic network model for semantic segmentation of underground mine slopes. The BiSeNet network designs two branch networks, namely Spatial Path and Context Path (CP). Aiming at the problem of spatial information loss in semantic segmentation, the spatial path solves the problem of spatial information loss by acquiring more low-level features, thus encoding rich spatial information, while the context path provides a larger receptive field.
The main structure of the BiSeNet network is shown in
Figure 2. There are 3 layers in the spatial path. The size of the convolution kernel of the first layer is 7 × 7, and the size of the convolution kernel of the last two layers is the same as 3 × 3. Each layer has to undergo convolution with a stride of 2. Using convolution (CONV), batch normalization (BN) and a ReLU nonlinear activation operation, the image size output by this path is 1/8 of the original image, and the context path provides enough experience and semantic information to the network. The attention refinement module (ARM), based on the global average pooling to obtain the global context, guides feature learning by calculating the attention vector, and refines the output features of the context path. Obtaining global contextual information through an attention optimization module can improve image segmentation accuracy. The model designs a feature fusion module (FFM) to fuse spatial features and contextual information to improves the feature extraction capability. The fusion layer uses bilinear interpolation to upsample the feature map size extracted by the context path to the same size as the feature map obtained by the spatial path, and then these features are fused through the feature fusion layer. This module pools connected features into feature vectors and computes weight vectors, which are used to re-weight the features to output features that combine spatial and contextual paths.
The underground mine road environment is complex, and has problems such as uneven roads, obstacles, and insufficient illumination. The mainstream deep learning model, although high in accuracy, cannot run complex deep learning network models due to the limited computing power of the mine car’s on-board computer. Therefore, in the face of the complex mine road environment, it is difficult to meet the real-time demand. BiSeNet’s semantic segmentation algorithm uses a lightweight backbone specially designed for classification tasks, and cannot give full play to the performance of the BiSeNet algorithm in the field of semantic segmentation. Due to the complex environment of underground mines and poor light conditions, BiSeNet using the classification backbone network cannot accurately identify roads. For the situation of underground mine roads, the network needs to be improved. STDC is a real-time semantic segmentation network proposed by Meituan, which gradually reduces the dimension of feature maps and uses aggregated information to represent images [
30]. On BiSeNet, this study replaced the backbone network with STDC. By changing the backbone network, the real-time performance of the network model is improved, and can also achieve better real-time performance within the complex and changeable environment of the underground mine road. The improved BiSeNet mainly consists of a spatial path and a contextual path. First of all, the lightweight network short-term dense concatenate is used as the backbone network of the encoder in the space path, and the original three sets of convolutional layers are replaced. The replaced backbone is composed of STDC modules. The number of deep feature channels is small, and the number of shallow feature channels is large. The STDC module uses skip connections to enrich features, so that the lower layers have enough channels to encode more detailed information with smaller receptive fields, and the higher layers with larger receptive fields strengthen the expression of high-level information. The backbone network composed of STDC modules can reduce the calculation amount of the model network and improve the calculation speed. A unified attention fusion module (UAFM) is added to the context path, which enriches the feature representation. The UAFM improves the detection level of underground mine slopes through channel and spatial attention. The semantic segmentation network model of the improved BiSeNet is shown in
Figure 3; the red box of the space path is the lightweight backbone network STDC, and the red box of the context path is the UAFM.
2.1. STDC Backbone Network
2.1.1. STDC Module
Figure 4 shows the structure of the STDC module. Each module is divided into several blocks, and ConvX
i is used to represent the operation of the i block. The output of block i can be calculated by the following formula:
The input of i module is xi−1, and the output is xi. ConvX consists of three parts: convolution layer, normalization layer and ReLU activation layer. Ki is the size of the convolution kernel. The kernel size of the first ConvX of this module is 1 × 1, and the remaining ConvX kernels are 3 × 3. Assuming that N is the number of channels output from the STDC module, the number of filters in the ith convolutional layer can be calculated from N/2i.
2.1.2. STDC Module Receptive Field
Scalable receptive field and multi-scale information is the focus of image semantic segmentation. The lower layers need enough channels to encode more detailed information with smaller receptive fields, while the higher layers with larger receptive fields pay more attention to the expression of high-level information, and setting the same channel as a lower layer may lead to information redundancy. The x
1 to x
n features are spliced together as the output through the skip path, thereby enriching the feature information. Before cascading, the average pooling of 3 × 3 is used to downsample the response maps of different blocks in the STDC module to the same size. The output results are as follows:
where X
output represents the STDC module output, F represents the fusion operation, and X
1, X
2, X
n are the feature maps of all n blocks.
2.1.3. STDC Network
STDCMNet (short term dense concatenate network) is a lightweight semantic segmentation network proposed by Meituan, which is an upgrade and improvement based on BiSeNet. STDCNet has two main contributions. On the one hand, it improves the backbone of the backbone network and changes the structure of the dense concatenate module. In the same STDC module, the number of channels output by each ConvX gradually decreases with the increase in the receptive field, and finally concatenate together, so it contains more feature scale information. On the other hand, it is a multi-branch low-level detail information assisted training structure. The detail information guidance structure is only used during training. After the network training is completed, it can be directly discarded. This method can reduce the amount of calculation during reasoning compared to the previous BiSeNnet. The network structure organized from the perspective of network output is shown in
Figure 5.
As shown in
Figure 6, excluding the input and output layers, the STDC network consists of 6 stages. Stages 1 to 5 downsample the input spatial resolution with a step size of 2, and the final feature map size is 1/32 of the input image. Stage 6 outputs the prediction logic using ConvX, a global average pooling layer, and two fully connected layers. Stages 1 and 2 use only one convolutional block to extract features, thus increasing efficiency.
2.2. Unified Attention Fusion Module (UAFM)
Underground mine roads are the most irregular of unstructured roads, and are affected by light, water, potholes, etc., making it difficult to achieve accurate identification of underground mine roads. Aiming at the problem of low recognition accuracy of underground mine roads, this study uses a unified attention fusion module to enrich the fusion feature representation through spatial and channel attention fusion, thereby improving the recognition accuracy of the model. As shown in
Figure 7, the UAFM utilizes an attention module to generate the weight α, and fuses the input features with α through Mul and Add operations. The input features are denoted as F
high and F
low. F
high is the output of the deep module, and F
low is the corresponding module of the encoder with the same number of channels.
The UAFM uses a bilinear interpolation operation to upsample F
high to the same size as F
low, and the upsampled feature is denoted as F
up. Then, the attention module takes F
up and F
low as input and generates the weight α. After obtaining the attention weights, Mul operations are performed on F
up and F
low, respectively. Finally, the UAFM performs the Add operation on the attention-weighted features and outputs the fusion features, calculated as follows:
The spatial attention module, shown in
Figure 8, exploits the inter-spatial relationship to generate a weight that represents the importance of each pixel in the input feature.
Given input features, F
up ∈ R
C×H×W sum, 4 features are first generated by averaging and maximizing operations along the channel axis, where the dimension is R
1×H×W. Then, these 4 features are concatenated onto one feature. For the concatenated features, convolution and sigmoid operations are applied to the output α ∈ R
1×H×W. The calculation formula of the spatial attention module is as follows:
Figure 9 shows the channel attention module. The key to the channel attention module is to use the relationship between channels to generate a weight. For this weight, each channel of the input feature is important. The proposed channel attention module uses the average and max pooling to compress the spatial dimension of the input features.
This process generates 4 features using dimension R
C×1×1. These 4 features are then concatenated along the channel axis, and convolution and sigmoid operations are performed to produce a weight α ∈ R
C×1×1. The formula is expressed as follows:
4. Discussion
In addition to comparing the improved network with the network, this paper also selects the SegNet network with high segmentation accuracy and fast detection speed, and the small size of the ENet network is compared and verified with the network in this paper. The comparison results are shown in
Table 2.
SegNet is a classic semantic segmentation network. In this experiment, mIoU reached 86.55%, and the accuracy rate is the second highest, reaching 97.81%. Its achievement lies in its encoder–decoder structure. The way the decoder performs upsampling directly uses the pooling index when performing the maximum pooling in the corresponding encoder stage to perform non-linear upsampling. However, the frames per second of SegNet is only 2 f/s, which is the slowest among the four networks. The mIoU of ENet reached 85.34%, and the pixel accuracy reached 97.53%, which is similar to that of SegNet. Because the ENet network is relatively light, its operating speed reached 13 f/s, which is much faster than that of SegNet, but it is still not sufficient for real-time purposes. The mIoU of BiSeNet is the lowest, only 74.83%, the Dice is also the lowest, only 83.22%, and the frame rate is only 7 f/s. Compared with SegNet and ENet, BiSeNet has almost no advantages.
It can be seen from the data in the table that the improved BiSeNet in this paper has obvious advantages in the mean intersection over union and pixel accuracy. The improved BiSeNet, due to the replacement of the backbone and the addition of a unified attention fusion module, mIoU and fps are 89.34% and 13 f/s, respectively, which have been greatly improved. Compared with the original BiSeNet, the mean intersection over union and pixel accuracy are improved by 19% and 3%. Compared with the previous network, the segmentation effect is the best, and can meet the real-time requirements. The frame rate of the network in this paper can reach 23 f/s, which basically meets the requirements of real-time recognition. The experimental results show that the improved network segmentation rate has been significantly improved, and compared with other networks, the accuracy is also higher, and it has a better segmentation effect.
In order to illustrate the effectiveness of replacing the backbone network and adding the UAFM module, we compared the performance of the network before and after the replacement on underground mine roads, as well as the performance after adding a unified attention fusion module, and the mIoU improvement after replacing the backbone network with STDC. After adding the UAFM on this basis, the mIoU increased by 2.08%.
Table 3 shows the specific experimental results.
Figure 13 includes the following images: (a) shows the input color image; (b) shows the label; (c) shows the target for road recognition in underground mines; (d)~(g) show the results of different network models.
The continuous identification of underground tunnels can explain the robustness of the model. Therefore, we conducted continuous identification experiments on underground mine tunnels.
Figure 14 is a comparison chart of the mine road recognition results at different time points. It can be seen that the improved network has a better effect on identifying underground mine slope roads.
The external environment also affects the recognition of underground mine roads, and the underground mine road environment is special, and there is wind and sand blowing from time to time, which has a certain impact on the camera. Moreover, there are more straight roads and fewer curves in underground mine roads. Therefore, in order to prove the robustness and advancement of the improved network, this paper takes the bends and wind and sand into account, and verifies it through comparative experiments. The experimental results are shown in
Figure 15.
5. Conclusions
Aiming at the semantic segmentation of underground mine roads, an improved network detection model BiSeNet using deep learning technique is proposed. It integrates the backbone network of the STDC model with the agreed attention fusion module UAFM, thereby enriching the feature representation and improving the learning ability of the network model. The improved BiSeNet mine road detection model is tested on the dataset constructed based on the underground mountain road image, and compared with other methods. The experimental results show that the improved BiSeNet network increases the mIoU and PA of the mine road to 89.34% and 98.34%, respectively, and the frame rate to 23 f/s. It provides a reliable reference for the environmental perception of unmanned mine trucks on underground mine roads.
Although the semantic segmentation network proposed in this paper can achieve better detection of mine road areas in the context of underground mines, this study only recognizes the road areas in the image, and does not take into account the interference factors such as road workers and rockfalls on the mountain, which will be studied and solved in our future work.