1. Introduction
As one of the most significant energy sources in China, the country’s coal output is projected to reach 4.71 × 10
10 tons by the end of 2023 [
1]. This figure is expected to account for approximately 66.6% of the total primary energy production for that year. During the coal mining and preparation processes, more than 10% of gangue waste is generated [
2]. The combustion of this gangue within coal not only diminishes the actual calorific value but also releases a substantial amount of toxic gases, resulting in severe environmental pollution. Consequently, the accurate, rapid, and effective separation of coal gangue is essential for enhancing coal quality and promoting sustainable development in coal mining operations.
The conventional methods for separating coal from gangue primarily include manual sorting, selective crushing, heavy medium separation, wind separation, jigging coal preparation, and other techniques. Traditional manual sorting is characterized by high labor intensity and low efficiency. In contrast, the heavy medium method and selective crushing techniques often face challenges such as high equipment costs, significant environmental pollution, and operational complexity [
3,
4,
5]. With advancements in artificial intelligence technology, intelligent coal preparation has increasingly become a primary focus of research in this field. Common methods for identifying coal and gangue include ray recognition, infrared thermal imaging, visual recognition, and multispectral recognition, among others [
6].
Traditional image recognition requires the manual configuration and extraction of image features through artificial methods. Chen Li et al. [
7] utilized wavelet techniques to denoise coal gangue images and developed wavelet moments for feature extraction. They then differentiated between coal and gangue based on the extracted feature values, thereby improving the efficiency of coal separation from gangue. Shen Ning et al. [
8] utilized support vector machines and the Relief algorithm to extract 28 features from images of coal and gangue. They then screened these features to develop an optimal classifier for recognition purposes. This method improved the accuracy of identifying raw coal with diverse appearances. Cheng Gang et al. [
9] capitalized on the differences in heat absorption capacity between coal and gangue by capturing their images with an infrared thermal imager. They subsequently employed support vector machines to classify and identify these images, thereby demonstrating the feasibility of infrared thermal imaging for coal and gangue recognition. Yang Huigang et al. [
10] were the first to determine the thickness of coal and gangue by scanning images captured with a CCD camera. They then employed X-ray imaging for identification, which reduced the impact of thickness on the visual representation of coal and gangue. Li Hequn et al. [
11] proposed an Otsu threshold segmentation method based on the Gaussian pyramid to obtain images of coal and gangue. They employed a gray area size matrix and a gray level co-occurrence matrix for feature extraction, subsequently utilizing a support vector machine for classification and recognition to mitigate the impact of lighting variations on the images. Yu Le et al. [
12] distinguished between coal and gangue by partially compressing the gray levels of the images while simultaneously analyzing four characteristic parameters derived from the gray level co-occurrence matrix. This approach offers a novel technical method for the separation of coal and gangue.
The image recognition method based on deep learning has gained widespread application in the field of coal and gangue identification due to its high accuracy, rapid detection speed, robust performance, and other beneficial characteristics. Shan Pengfei et al. [
13] proposed an algorithm that employs an enhanced Faster R-CNN framework to tackle the problem of excessive dust generated during fully mechanized top coal caving mining. They developed a test bench to validate the identification and localization of coal gangue at the moment of caving. Du Jingyi et al. [
14] introduced an enhanced SSD algorithm that incorporates GhostNet as a replacement for the backbone network of SSD. This modification results in a lightweight network structure that effectively detects small targets associated with coal and gangue. Wang Deyong et al. [
15] proposed an enhanced algorithm based on YOLOv5s, which achieved efficient detection in complex environments characterized by high noise levels and low illumination. Guo Yongcun et al. [
16] introduced a method to optimize the convolutional neural network (CNN) algorithm by implementing weight migration and simplifying the neuron model. This approach enhances the performance of small target detection while preserving spatial information. Teng Wenxiang et al. [
17] developed a coal gangue recognition algorithm based on the HGTC-YOLOv8n model, which significantly improved the recognition capability for overlapping and occluded targets.
Guanghui Xue et al. [
18] proposed a lightweight YOLO coal gangue detection algorithm based on ResNet18, which significantly enhanced both the real-time performance and accuracy of the model’s detection capabilities. This improvement was achieved by substituting the backbone network and implementing feature-scale clipping in conjunction with unstructured pruning. Honggaung Pan et al. [
19] proposed an enhanced YOLOv3-tiny algorithm for the detection of coal gangue. This approach incorporates the spatial pyramid pooling (SPP) network, the squeeze-and-excitation (SE) module, and dilated convolution techniques to achieve rapid and efficient sorting of coal gangue while simultaneously reducing computational complexity. Deyong Shang et al. [
20] proposed a lightweight coal gangue recognition algorithm that enhances YOLOv5s by incorporating the SimAM attention mechanism and the GhostNet backbone network. This approach aims to achieve a balance between model efficiency, computational load, and parameter count, particularly under low-light conditions.
With the advancement of intelligent construction in coal mining, there has been a growing emphasis on the intelligent sorting of coal gangue. Although significant progress has been made in current detection methods for coal gangue sorting, these methods still demonstrate limitations in balancing real-time processing, detection accuracy, model deployment, and the detection of small targets. Although the YOLOv3 algorithm demonstrates high precision, it has limitations in real-time performance and detection efficiency. In contrast, although the YOLOv5 algorithm has made significant progress in enhancing real-time performance, its robustness is still insufficient for the detection of targets in complex backgrounds. As a newly released algorithm, YOLOv11’s performance does not yet fully meet the requirements for coal and gangue sorting tasks, particularly in challenging environments, thereby presenting ongoing challenges.
In response to the complex challenges posed by mixed coal waste and motion ambiguity during the detection of coal gangue, this study introduces the EBD-YOLO algorithm for the identification of coal and gangue. This approach aims to address the limitations inherent in existing methodologies. This approach aims to overcome the limitations inherent in existing methodologies. Specifically, an EMA attention mechanism [
21] has been integrated into the C3k2 module, leading to the development of the C3K2-EMA module. This new module replaces all existing C3k2 modules within the backbone architecture, thereby enhancing the surface feature extraction capabilities of the coal and gangue model. The bidirectional feature fusion module, BiFPN, is integrated into the Neck layer to more effectively combine feature information from coal and gangue images and enhance the efficiency of model processing. By replacing the original model’s target detection head with DyHead [
22], which integrates a self-attention mechanism, the model can significantly enhance its focus on critical feature areas after multiple downsampling processes. This modification improves the representation capability, spatial perception, and feature expression of the detection layer, thereby enhancing the model’s performance in detecting small targets. Through these innovations, the proposed method not only effectively addresses the limitations of existing approaches in adaptability to complex environments, and real-time detection capabilities, but also enhances the model’s accuracy and efficiency. Furthermore, it is better equipped to tackle the challenges associated with intelligent sorting tasks in coal mines.
3. Experiment and Analysis
3.1. Experimental Environment Configuration
The computer hardware configuration used in this experiment is a Windows 10 operating system, Inter (R) Core (TM) i5-12400F CPU @ 2.5 GHz processor, RTX 4060Ti graphics card, 32 GB running memory, and software environment configuration is Python 3.10.14 + PyTorch 2.2.2 + CUDA12.1. The model parameters are set as follows: input image size of 640 × 640, training cycle of 300 rounds, batch size of 16, initial learning rate of 0.01, momentum parameter of 0.937, weight decay factor of 0.0005, and 50 early stopping rounds.
3.2. Experimental Dataset
The dataset used in this study is a self-constructed collection consisting of coal and gangue sourced from a coal preparation plant in Datong City, Shanxi Province. The video footage was captured using the ZHS2580 intrinsically safe explosion-proof camera (better, Shenzhen, CA, CA), the resolution of the acquired image is 1920 × 1080. Subsequently, Potplayer software, version 1.7.21953, was utilized to periodically extract frames from the video, resulting in the selection of 495 valid images. To address the challenges posed by dust, noise, occlusion, and other issues encountered during the coal gangue sorting process, we expanded the dataset to 2000 samples using various data augmentation techniques. These techniques included flipping, adaptive histogram equalization, adding noise, and adjusting brightness. This expansion aims to enhance the model’s generalization capabilities, improve its robustness, and mitigate the risk of overfitting during the training phase. In the data annotation phase, LabelImg was utilized to annotate the images, assigning labels for coal and gangue accordingly. Ultimately, the dataset, which comprised 2000 images, was divided into a training set, a validation set, and a test set in a ratio of 7:2:1. The resulting distributions of images were 1400 for the training set, 400 for the validation set, and 200 for the test set. A partial dataset image of data enhancement is shown in
Figure 6.
3.3. Model Evaluation Indicators
To verify the performance of the improved model, experiments were conducted using precision (P), recall (R), and mean of average precision (mAP@0.5). As the main evaluation criteria for model accuracy detection, floating point operations per second (FLOPS), parameters, and frames per second (FPS) are also used to measure model performance. The calculation formulas for precision P and recall R are shown in Formulas (7) and (8):
Among these metrics, true positive (TP) denotes the number of instances that the model correctly identifies; false positive (FP) indicates the number of incorrect identifications made by the model; and false negative (FN) represents the number of actual positive samples that the model misclassifies as negative. The formula for calculating mean average precision (mAP) can be derived from precision (P) and recall (R), as illustrated in Equation (9):
Among these metrics, AP denotes the recognition accuracy for a single category, while n represents the type of image. Generally, the mean average precision (mAP) is calculated at an intersection over union (IoU) threshold of 0.5, which reflects the average detection accuracy across all categories. A higher mAP value indicates superior model performance.
The FLOPS (floating point operations per second) of a model quantifies the number of floating-point operations executed in one second. Parameters serve as a metric for evaluating the size and complexity of the model. A model characterized by a high parameter count indicates a greater consumption of storage and computational resources, while a model with fewer parameters is generally more lightweight.
3.4. Ablation Experiment
To evaluate the optimization effects of each module presented in this article, ablation experiments were conducted using the YOLOv11n benchmark model. The results of these experiments are summarized in
Table 1.
Table 1 demonstrates that model ① replaces the C3k2 component in the original model with the C3k2-EMA module, resulting in a 0.7% increase in mAP@0.5 for the model. This finding suggests that the incorporation of the C3k2-EMA effectively addresses the information loss issue encountered during the feature extraction process inherent to traditional C3k2 models.
By incorporating an attention mechanism into the C3k2 module, the model’s ability to extract multi-scale features is significantly enhanced. Model ② introduced BiFPN independently within the original framework, resulting in a 1.5% increase in the p value and a 1.2% improvement in mAP@0.5. Furthermore, this modification achieved the greatest reduction in model parameters among all modules.
This suggests that the BiFPN module can achieve both a lightweight design and enhanced performance while also improving the detection of similar features between coal and gangue. Model ③ incorporates the DyHead module independently, resulting in a 0.3% improvement in mAP@0.5 and a 4.3% increase in R, albeit with a slight rise in the number of parameters and computational complexity. This indicates that the DyHead module significantly enhances the model’s detection capabilities for coal and gangue by integrating three attention mechanisms into a unified framework.
In Model ④, the C3k2-EMA module and the BiFPN module are integrated into the original architecture. This modification results in an increase of 1.5% in mAP@0.5 and 83.4% in R. Although the number of parameters is slightly reduced, the FPS is decreased by 0.2%. The findings indicate that the C3k2-EMA module and the BiFPN module mutually enhance one another. Together, these two modules significantly improve detection outcomes by augmenting feature extraction capabilities and facilitating multi-scale feature fusion, although this comes at the expense of some FPS.
Model ⑤ represents the original framework that integrates both the C3k2-EMA module and the Dyhead module. This model demonstrates an increase in mAP@0.5 by 3.3%, with P and R improving by 3% and 4.4%, respectively. Additionally, FPS are enhanced by 5.3%. The results indicate that integrating the C3k2-EMA module with the Dyhead module significantly enhances detection performance. Furthermore, although the model’s parameter count increases by 21.7%, this increment is still lower than that observed in other models.
Model ⑥ incorporates the BiFPN module, building upon model ③. This integration results in a 1.5% increase in mAP@0.5, along with improvements of 2.3% in both precision and recall and an enhancement of 8.2% in FPS. This demonstrates that model ⑥ effectively combines the bidirectional feature fusion capabilities of BiFPN with the computational efficiency of Dyhead, thereby improving detection accuracy and FPS.
In summary, the enhanced model ⑦ exhibits the highest overall performance. It accomplishes this by maintaining high efficacy while only marginally increasing the number of parameters. Additionally, it provides superior capabilities in real-time detection, effectively addressing the practical requirements for identifying coal and gangue in complex backgrounds.
3.5. Contrast Experiment
3.5.1. Comparative Analysis of Different Models
To validate the advantages of the enhanced model proposed in this study for coal and gangue detection, experiments were conducted under conditions identical to those used for current mainstream object detection algorithms, including YOLOv3-tiny, YOLOv5s, YOLOv8n, YOLOv10n, and YOLOv11s. The results are presented in
Table 2. Given the constraints of the hardware detection platform, it is essential to minimize both the parameter count and computational complexity of the selected model to maintain high detection accuracy.
According to
Table 2, the highest values for P, R, and mAP@0.5 were 88.7%, 83.9%, and 91.7%, respectively. In comparison to the baseline model YOLOv11n, the values for R and mAP@0.5 are 3.4%, 3.7%, and 3.9% higher, respectively. Although the computational and parameter requirements of the enhanced model have increased, there has been a 10.01% improvement in FPS.
Compared to the baseline model YOLOv11s, the P, R, and mAP@0.5 exhibited increases of 0.6%, 1.6%, and 0.9%, respectively. Although the improved model requires a higher computational load and an increased number of parameters, it has achieved a significant enhancement in FPS by 10.01%.
Compared to YOLOv11n, which demonstrates higher detection accuracy, the improved model shows increases in P, R, and mAP@0.5 of 3.4%, 3.7%, and 3.9%, respectively. Furthermore, the model’s parameter count and computational complexity have been significantly reduced, effectively addressing practical detection requirements.
In comparison to the YOLOv3-tiny and YOLOv5s models, mAP@0.5 increased by 5.2% and 1.9%, respectively. Additionally, the model’s parameter count decreased by 70.0% and 62.9%, respectively, while computational complexity was reduced by 38.8% and 50%, respectively. Compared to the EBD-YOLO model, the FPS of the YOLOv3-tiny and YOLOv5s models increased by 4.8% and 3.8%, respectively. While these two models exhibit enhanced real-time performance relative to the EBD-YOLO model, they contain a significant number of parameters and FLOPS, which may impede their practical deployment in real-world applications.
Compared to the lightweight YOLOv8n model, the improved model exhibits a slight increase in computational complexity and reduces FPS by 9.1%. However, it achieves a 3.1% increase in mean average precision at a threshold of 0.5 (mAP@0.5). In comparison to the lightweight YOLOv10n, the enhanced model shows a decrease in FPS of 16.4%. While the YOLOv10n model achieves the highest FPS, the enhanced version effectively reduces both computational and parameter complexity while maintaining high performance. Notably, it demonstrates a 4.7% improvement in mAP@0.5, thereby achieving a better balance between detection accuracy.
To further illustrate the detection performance of the enhanced model, various models were evaluated using mAP@0.5, as depicted in
Figure 7. In summary, the improved model proposed in this study effectively balances performance metrics when compared to other mainstream models. It demonstrates commendable results in both detection accuracy and computational efficiency, surpassing most conventional algorithms. Consequently, it is well suited for the complex environments encountered at coal gangue sorting sites.
3.5.2. Comparison of Different C3k2 Modules
To evaluate the effectiveness of the C3k2-EMA module, comparative experiments were conducted by replacing the benchmark model YOLOv11n with the C3k2-CA, C3k2-ECA, and C3k2-Faster modules under identical experimental conditions. The results of these experiments are presented in
Table 3 below.
As illustrated in
Table 3, the parameters and FLOPS of all module types remain largely consistent. Among these modules, the C3k2-EMA module demonstrates the highest mAP@0.5, achieving a value of 88.5%. However, it is important to note that the FPS for this module has decreased by 2.4%. The mAP@0.5 of the C3k2-CA module has decreased by 0.2%, while the model’s FPS has increased by 1%, reaching a value of 83.5 f·s
−1, albeit with a slight compromise in detection accuracy. In contrast, the mAP@0.5 of the C3k2-ECA module has improved by 0.3%; however, the FPS of this model remains largely unchanged. The C3k2-Faster module achieves the highest FPS with an increase of 2.0%, although this improvement comes at the cost of some reduction in detection accuracy. The C3k2-CA and C3k2-ECA modules provide an impressive balance, while the C3k2-Faster module effectively meets the requirements for both a lightweight design and real-time performance. Furthermore, the C3k2-EMA module greatly improves detection accuracy, achieving an optimal balance among detection precision, computational load, and frame rate. This configuration successfully addresses the dual demands of accuracy and real-time processing in practical applications. In the YOLOv11n model, the primary function of the C3k2 module is feature extraction. In summary, the C3k2-EMA module, which demonstrates superior detection accuracy, has been selected for subsequent experiments. This module offers significant advantages and effectively meets the requirements for multi-scale feature extraction, as well as the specific demands associated with coal rod stone sorting.
3.5.3. Supplementary Verification
To verify the superiority of the improved algorithm compared to other YOLO models, experiments were conducted using algorithms from relevant research literature. These experiments employed the dataset presented in this paper under identical experimental parameters, resulting in the findings summarized in
Table 4. As illustrated in
Table 4, the EBD-YOLO algorithm exhibits a 4.1% increase in mAP@0.5 compared to HGTC-YOLOv8n. While the model’s parameters and FLOPS remain largely unchanged, there is a significant improvement of 30.3% in FPS. The EBD-YOLO algorithm, in comparison to the SG-YOLO algorithm, shows an increase in mAP@0.5 by 3.9%, a reduction in parameters by 29.3%, and an improvement in FPS by 13.3%. Compared to the SS-YOLOv3-tiny and EBD-YOLO algorithms, the mAP@0.5 has increased by 1.2%. Furthermore, there has been a significant reduction in both the number of parameters and FLOPS, while FPS has improved by 94%. According to the data presented, utilizing YOLOv3 as the benchmark model can significantly improve the system’s detection accuracy. However, at this stage, both the real-time performance and detection efficiency of the model are still suboptimal. Compared to the YOLOv5 and YOLOv8 models within the same series, the YOLOv11 model serves as the benchmark for performance evaluation. This enables the final enhanced algorithm, EBD-YOLO, to strike a balance between real-time processing capabilities and detection accuracy while addressing practical deployment challenges.
3.6. Visual Analysis and Experiments
To assess the detection performance of the EBD-YOLO model, both the YOLOv11n model and the EBD-YOLO model were utilized to analyze images from the test set. The visualization results for coal gangue are displayed in
Figure 8.
The prediction box for middling coal is represented in light blue and labeled as “coal”, while the prediction box for gangue is depicted in dark blue and labeled as “gangue”. As illustrated in
Figure 8, both the YOLOv11n model and the EBD-YOLO model demonstrate effective capabilities in detecting both coal and gangue. The YOLOv11n model exhibits specific errors and omissions in detecting coal and gangue. In contrast, the EBD-YOLO model shows significant improvements in addressing these issues during the detection process. Furthermore, the detection accuracy of the EBD-YOLO model has markedly improved compared to the benchmark model, YOLOv11n.
The heatmap provides a clear and intuitive representation of the critical information emphasized by the model during the coal gangue detection process. This article utilizes the Grad-CAM [
27] heatmap visualization strategy for analysis. Grad-CAM effectively captures gradient information from the final convolutional layer, enabling the prediction of weight information across various channels. Through a weighted summation process, a heatmap is generated and subsequently overlaid onto the original image. This representation illustrates the relative focus of attention during the detection process.
Figure 8d presents the results of the thermodynamic diagram for the EBD-YOLO model. As shown in
Figure 8d, the EBD-YOLO model exhibits a greater focus on the surface irregularities of coal and gangue during the detection process. This observation indicates that the enhanced model is more proficient at extracting relevant information during feature extraction, thereby confirming its robustness in complex scenarios.