1. Introduction
Due to the limitations of equipment and technology, tramp materials such as gangue, bolt, stick, iron sheet, and iron chain will be mixed into the raw coal in the mining process [
1,
2]. These tramp materials mixed in the raw coal can easily cause blockage and scratches of the transfer equipment, and can even cause the failure of the equipment, and in particular, can cause accidents [
3]. As the most important solid waste generated during coal mining, gangue will affect the calorific value of coal during the combustion process and cause environmental pollution [
4]. Therefore, the automatic and efficient separation of coal and tramp materials is of great significance for ensuring safe production and for improving coal mining efficiency [
5]. At present, most coal mines use manual separation to remove tramp materials in raw coal, as shown in
Figure 1. This method mainly relies on the manual identification of objects, resulting in a poor working environment, high physical labor intensity, and low productivity, all of which endangers the health of miners, and is not in line with the intelligent development of mines. Other sorting methods, such as wet sorting, will use a lot of water and cause water pollution, while dry sorting has become a potential direction of sorting technology due to non-contact and economic efficiency. Among them, computer vision-based mineral separation technology has become a hot research topic in recent years [
6]. Therefore, detecting tramp materials mixed in raw coal from optical images is an economical and effective method of beneficiation [
7,
8,
9].
A tramp material image of raw coal has a rich color, grayscale, texture, shape and contains other characteristic information. However, due to surface pollution, light changes, size differences, and the variety of tramp material shapes, how to extract robust tramp material image features is a challenging task. The detection method based on traditional image processing constructs are featured manually and use support vector machines for classification [
10]. This method has high requirements for the detection environment and highly depends on the manual parameter setting, so there is still much room for improvement in the generalization effect and stability of the detection [
11,
12,
13]. On the other hand, the object detection technology based on deep learning improves the performance of the detector by learning from training data and by adaptively extracting stable image resources [
14].
Previous research has focused on the detection of gangue, ignoring the detection of other types of tramp materials, especially small targets. In this research, we designed a object detection algorithm for multiple tramp materials in raw coal. To ensure real-time target detection performance, we used a structure similar to SSD [
15]. SSD uses a multi-scale feature map strategy to detect objects of different sizes in real time. However, because there is no semantic interaction between shallow and deep feature maps, small targets cannot be detected effectively. For this reason, SSD is not suitable for detecting multi-scale tramp material targets. To improve the performance of the detector, we first used a densely connected network containing a four-stage feature extractor to stably down-sample the feature map. The feature extraction module adopts a multi-branch dilation convolution structure to realize the efficient use of feature images by fusing feature images of different receptive fields. The cross-layer feature fusion and the attention fusion module are used in the feature fusion module to fuse the position information of the shallow feature map and the semantic information of the deep feature map to improve the detection ability of small objects. We refer to the proposed detector as the tramp materials in raw coal single-shot detector (TMRC-SSD). The main contributions of this research can be summarized as follows:
1. A CNN model-based detection algorithm for multiple tramp material objects in raw coal is proposed, namely TMRC-SSD. It can extract multi-scale image features, and can effectively detect five types of tramp materials, including gangue, bolt, stick, iron sheet, and iron chain by training deep learning models. Experimental results show that our proposed network achieves 96.12% MAP at 24FPS.
2. Construction of an image dataset of tramp materials in raw coal; we verified the effectiveness of our proposed MDCS, CLFF and AFM modules in improving detection accuracy through experiments. The experimental results showed that the TMRC-SSD network increases the AP of small object detection by 4.1 to 95.57%.
The organizational structure of this article is as follows. In
Section 2, we discuss the previous research on the detection of tramp materials in raw coal. In
Section 3, we introduced the detail of TMRC-SSD. In
Section 4, we introduced the dataset and evaluation index. In
Section 5, we obtained the experimental and visualization results. Finally,
Section 6 summarizes the research results.
3. Proposed Method
3.1. Backbone Network
The shallow feature map retains more detailed information of the image. The traditional SSD network performs feature extraction by continuously reducing the feature map. This has the advantage of reducing the computational cost, but losing part of the image information. Inspired by the densely connected networks, we designed a modified DenseNet as the backbone network of TMRC-SSD, named M-Densenet, as shown in
Table 1.
A 7 × 7 convolutional layer and a 3 × 3 maximum pool was inserted before the first dense block in the original DenseNet design (
Figure 2). In M-DenseNet, we designed the steam block to replace the operation before stage (1). The input image in the steam block first passes through a 3 × 3 convolution with stride as 2, and then connect a central asymmetric convolution structure, and finally performs a filter concatenate and 1 × 1 convolution operation. Such an operation can effectively improve the expressive ability of features without increasing the computational cost.
All DenseNet structures contain a 4 dense block. The pooling operation is used in each transition layer to down-sample the feature map. As the backbone of TMRC-SSD, M-DenseNet is different from the base DenseNet, which increases the number of convolution operations in the dense block. In order to prevent the difficulty of feature mapping due to the deepening of the network, we use the strategy of a simple transition layer to improve the performance of the detector.
All layers of Dense Block output k feature maps after convolution indicate that the number of channels in the generated feature map is k. Because DenseNet uses a concatenation connection method, if each layer generates k feature maps, it will produce a large number of feature maps; in order to prevent the network from becoming very wide, k needs to be limited to a small range. In M-DenseNet, we set k = 32. Moreover, in order to reduce the width of the network, before each 3 × 3 convolution, 1 × 1 convolution is introduced as a narrow space level to check the calculation load. Compression can use convolution to compress the model dimensions between blocks. In M-DenseNet, we set the θ = 1 to enhance the expressive ability of features.
3.2. Extract Feature Module
Expanding the receptive field of the detector is an effective means to improve the performance of the detector. Previous researchers extracted multi-scale information by using multi-scale convolution kernels. With the continuous increase of the convolution kernels, the detection effect was improved while the computational cost increased rapidly.
Dilation convolution is a variant of traditional image convolution. By adding dilation to traditional image convolution, the receptive field expansion of traditional image convolution is completed. Dilation convolutions with different numbers of dilation are often used for the fusion of multi-scale feature map information and receptive fields. The calculation is as follows:
represents the size of the k layer receptive field; represents the size of the k − 1 layer receptive field; represents the size of the convolution kernel of the k layer; and is represents the size of the k layer before the step size of the i layer convolution.
The dilation convolution expands the receptive field without reducing the size of the feature map. The kernel size
of the
k layer of dilation convolution is as follows:
where
represents the kernel size of the
k layer of the original ordinary convolution, and
d represents the number of spaces inserted. Combining the above two formulas, the receptive field of the
k layer
size dilation convolution kernel for the
size ordinary convolution kernel expansion is
as follows:
The concrete description of the receptive field relationship is shown in
Figure 3. Take a 5 × 5 ordinary convolution kernel to perform a convolution operation on a 5 × 5 image matrix as an example. After the original image is convolved, a 1 × 1 feature map is obtained, that is, the feature map only saves the receptive field of pixels as the entire original image. The dilation convolution kernel is 3 × 3 and d = 2, as shown in
Figure 3b. Perform a convolution operation on a 5 × 5 image matrix with the dilation convolution to obtain a 1 × 1 feature map. It can be seen that the feature map obtained by this dilation convolution operation is consistent with the receptive field of a convolution operation using a 5 × 5 ordinary convolution. The dilation convolution uses a smaller parameter to achieve a larger receptive field.
Due to the preset fixed size of the convolution kernel, a standard convolution can only extract the information of some receptive fields in the image. In order to obtain richer image information, we designed a multi-branch dilation convolution structure (MDCS), as shown in
Figure 4. First, use 1 × 1 convolution to achieve channel interaction, then reduce the number of channels in each branch to a quarter of the previous input level. In order to reduce the parameter size, we add a 3 × 3 convolution operation on the middle three branches respectively. Next, the dilation convolution of d = 1, 2, 3 and 5 was used on each branch to enrich the receptive field. A splice operation is then used to join the branches. To avoid the instability of gradient transfer the shortcut is integrated into the designed modules. MDCS introduces dilation convolution and adjusts the expansion rate to obtain feature maps of different receptive field without increasing the calculation parameters. By fusing the receptive fields of different scales and making full use of the context information of the feature map, the feature representation ability of the model is enhanced.
3.3. Feature Fusion Module (FFM)
3.3.1. Cross-Layer Feature Fusion (CLFF)
In the original SSD design, a convolution kernel of a selected size was used to obtain the local receptive field in the image, and a series of sampling operations were carried out to make the size of the feature graph shrink continuously, and the semantic information of small target objects decrease continuously in this process. SSD adopts the strategy of a shallow feature map detecting small targets and deep feature map detecting large targets to detect targets of different scales, but this method has been proved to be defective in the detection of small targets. Inspired by the idea of feature fusion, we designed a cross-layer feature fusion module (CLFF) to fuse the position information of the shallow feature map and the high-level semantic information of the deep feature map to improve the model’s ability to detect small targets. In CLFF-1, the feature layer of 38 × 38 × 512 size is formed by the fusion of the three-level feature graph (stage1, stage2, conv7) extracted from the network, as shown in
Figure 5. The specific fusion operation is as follows: firstly, stage1 is down-sampled, and conv7 is up-sampled. After the size is the same, stage1, stage2 and conv7 all go through a 3 × 3 con-volution layer. In order to avoid the influence of gradient fluctuation, we use batch normalization operation for the fused feature images. We abbreviate batchnorm, scale, and ReLU as B-S-R. Other CLFF modules have a similar fusion process.
3.3.2. Attention Fusion Module (AFM)
Continuous down-sampling and convolution of the original image can obtain feature layers with different semantic features and resolutions. Selecting a suitable feature layer is very important for the detection of multi-scale targets, and the detection of small targets is a difficult problem. For basic SSD networks, small target detection is poor due to the lack of interaction between feature graphs. This paper designs the attention fusion module (AFM), which is embedded in the feature fusion module to fuse high-level and low-level semantic information, and construct a multi-scale feature map containing the object to be tested, which is used to improve the detection accuracy of small targets. The attention fusion block achieves the purpose of efficient feature extraction by assigning different weights to different channels and positions on the feature map. In order to ensure the unity of dimension of the adjacent feature graph, deconvolution operation is used here. Next, in order to make the network give a higher weight to the region of interest, SENet is integrated into the AFM, as shown in
Figure 6.
3.3.3. Residual Block
Traditional SSD achieves classification and regression by using convolution filters embedded in front of each detection head. We designed a residual block, adding it before each prediction layer, as shown in
Figure 7. We used standard convolution operations to obtain different receptive fields, and to add residual connections to achieve unity of features, which can effectively restrict the growth of parameters and increase the depth and width of the network. The use of the residual block makes the gradient of the loss function not directly flow to the back propagation network, which reduces the computational cost while effectively improving the detection accuracy and improving the network performance.
3.4. Construction of TMRC-SSD
The purpose of this research is to improve the accuracy of tramp material detection in raw coal. We propose an improved SSD detection framework named as tramp materials in raw coal single-shot detector (TMRC-SSD), as shown in
Figure 8. It mainly includes backbone network, an extract feature module, a feature fusion module, a detection head and non-maximum suppression. The original image is processed and entered into the subsequent processing network at a preset size. Next, after a stem block module, a four-level feature extractor is used in the backbone to down-sample the feature map stably. Multi-branch dilation convolutional structure was integrated into conv7~conv11 modules to obtain rich receptive fields and extract robust feature map information, so as to avoid the sharp increase of computing cost caused by the expansion of receptive fields. In order to realize the information interaction between feature graphs, four cross-layer feature fusion modules are integrated into the feature fusion module. The attention fusion block is used to fuse adjacent CLFF feature layers. SENet improves the accuracy of detection by enhancing the weight of the region of interest, and at the same time enhances the network’s ability to detect small targets. Before each prediction layer, a residual block is used to control the flow direction of the gradient in the loss function and improve the performance of the network.
3.5. Loss Function
The loss function of the network we designed mainly consists of two parts, confidence loss (
conf) and localization loss (
loc). The weight term α is set to 1 by cross-validation.
where, the
N represents the number of positive box. The localization loss is the smooth L1 loss between the prediction box (
) and the ground truth box (
). Parameters: (
,
), (
), and (
) represent the center, width, and height offsets of the default bounding box, respectively.
where
The confidence loss is the softmax loss over multiple classes confidences (
) given by:
where
6. Conclusions
Aiming at the problem of multiple tramp material detection in raw coal, this paper proposed a new object detection framework based on feature fusion and dense networks named TMRC-SSD, to detect five types of tramp materials such as gangue, bolt, stick, iron sheet, and iron chain. In order to improve the detection accuracy, we have designed three modules for the detection of tramp materials in complex environments, especially the detection of small targets: MDCS, CLFF and AFM. Firstly, we designed a modified DenseNet as the backbone of the detector, using a four-stage feature extractor to down-sample the feature map stably. MDCS uses a multi-branch structure and dilation convolution to obtain abundant receptive fields while reducing computational cost. In the feature fusion module, four CLFFs are used to fuse the shallow and deep feature maps to achieve semantic interaction between different feature maps. AFM is used to fuse adjacent CLFF feature layers. In the ablation experiment, the effectiveness of our proposed module is proved. In addition, we constructed a dataset containing 5 categories of tramp materials, including gangue, bolt, stick, iron sheet, and iron chain, to evaluate the performance of the detector. Experimental results show that our proposed TMRC-SSD network achieves 96.12% MAP at 24FPS, which is the most advanced result compared with other existing methods. The TMRC-SSD network increases the AP of small objects by 4.1 to 95.57%.
In the next work, we will further improve the performance of our network in the detection, speed and accuracy of tramp materials by expanding the dataset and improving the network.