Lithology Identification Based on Improved Faster R-CNN

Fu, Peng; Wang, Jiyang

doi:10.3390/min14090954

Open AccessArticle

Lithology Identification Based on Improved Faster R-CNN

by

Peng Fu

and

Jiyang Wang

^*

School of Artificial Intelligence, Shenyang University of Technology, Shenyang 110178, China

^*

Author to whom correspondence should be addressed.

Minerals 2024, 14(9), 954; https://doi.org/10.3390/min14090954

Submission received: 16 August 2024 / Revised: 11 September 2024 / Accepted: 16 September 2024 / Published: 21 September 2024

(This article belongs to the Special Issue Application of Deep Learning and Computer Vision in Petrographic Images Analysis)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In the mining industry, lithological identification is crucial for ensuring the safety of equipment and personnel, as well as for improving production efficiency. Traditional ore identification methods, such as visual inspection, physical testing, and chemical analysis, have many limitations in terms of their operational complexity and applicability. Modern ore identification technologies, especially those combined with deep learning methods, can effectively overcome these shortcomings and significantly enhance identification performance. However, mainstream deep learning object detection algorithms still face the issues of low accuracy and poor identification performance in challenging mining conditions. To handle these problems, an improved Faster R-CNN model is proposed in this study. Specifically, we replace the backbone network ResNet with Res2Net-50 and incorporate an improved Feature Pyramid Network (FPN) to enhance feature fusion, thereby further improving the model’s feature extraction capability. Region of Interest(ROI) Align replaces the ROI pooling layer to solve the spatial misalignment issue, providing a higher detection accuracy in tasks involving small object detection and precise boundary detection. Additionally, the backbone feature extraction network integrates an efficient channel attention (ECA) module to optimize high-resolution semantic information maps. By adding simulated noise, the model’s robustness and anti-interference capabilities are enhanced. Soft-NMS is used instead of traditional NMS, preserving more potential targets through a confidence decay mechanism, thereby improving the detection accuracy and robustness. The experimental results show that the improved Faster R-CNN model maintains efficient and accurate ore identification capabilities even in complex mining environments, demonstrating its great potential in practical applications. The model achieves significant improvements in detection accuracy and efficiency, providing strong support for the intelligent and automated identification of ores.

Keywords:

lithology identification; computer vision; deep learning; Faster R-CNN; Res2Next; improved Feature Pyramid Network; ROI Align; Soft-NMS; efficient channel attention

1. Introduction

The mining industry holds a crucial position in global economic and social development. It not only serves as an economic cornerstone for many countries but also provides essential raw materials for modern industries. However, as human society progresses toward modernization, the construction of large-scale infrastructure projects increasingly relies on minerals, resulting in the mining industry struggling to meet the growing demand for resources. Lithological identification plays a vital role in the mining process. It aids in identifying hazardous rock layers and structures, thereby preventing safety incidents such as landslides and collapses, which ensures operational safety. Moreover, it assists geologists in accurately assessing the quantity and reserves of mineral deposits, determining the types and distributions of ores, and providing reliable foundational data for mining development [1]. Personnel and equipment are critical to the operation of mining enterprises; therefore, accurate identification of rock types is fundamental to determining the conditions of underground rock masses. Effective lithological identification can optimize mining plans, reduce unnecessary extraction, and lower production costs. Additionally, it plays a significant role in improving mineral processing techniques and environmental protection. Accurate lithological identification not only enhances ore utilization and economic benefits but also ensures the sustainable development of mining operations.

The traditional identification techniques and classification methods primarily encompass three approaches. The first approach is physical visual inspection, where geologists or technical engineers rely on their extensive experience to make preliminary judgments about the appearance characteristics of rocks or ores. While this method is simple and convenient, it heavily depends on individual expertise and experience, leading to potentially subjective results. The second approach involves physical testing, which identifies rocks or ores based on their physical properties, such as their density, magnetism, and hardness. For example, high-precision instruments such as X-ray powder diffraction and scanning electron microscopy are employed for detailed analysis. These methods can provide objective and accurate data but are complex to operate and expensive and require highly skilled personnel. The third approach is chemical analysis, which involves using chemical reagents to react with rocks or ores to determine their chemical composition, such as through acid–base titration methods [2]. This method can accurately measure the chemical composition of rocks or ores but typically requires destructive sampling and involves lengthy and cumbersome procedures. Despite the significant role these traditional methods have played in mining development, providing valuable data for geological exploration and mining, they exhibit several notable drawbacks. First, the operational processes of these methods are relatively complex, requiring specialized equipment and personnel, thus increasing both costs and time. Additionally, some methods impose strict requirements on the samples, resulting in limited applicability in large-scale or rapidly changing mining environments [3]. Consequently, although traditional lithology identification techniques were crucial during the early and mid-stages of mining development, their limitations have become increasingly apparent in modern mining production. To enhance efficiency and accuracy, the mining industry urgently needs more advanced and efficient identification techniques and methods to address increasingly complex geological conditions and growing resource demands.

Thanks to the rapid development of computer technology, artificial intelligence, big data, and digital image processing technology, modern identification methods have not only greatly improved the accuracy and efficiency of identification but have also simplified the operation process, reduced costs, and significantly expanded the scope of application. At present, identification equipment based on computer vision mainly relies on two core technologies: image processing technology based on machine learning and image processing technology based on deep learning. In image processing technology based on machine learning, the main algorithms include decision trees [4,5], Naive Bayes [6], k-nearest neighbors (KNN) [7], and support vector machines (SVMs) [8]. These algorithms have been extensively experimented with, tested, and applied in the mining field. For instance, Kaibo Zhou et al. [9] proposed a lithology identification method based on the gradient boosting decision tree (GBDT) algorithm, which combines the performance of the GBDT algorithm with the Synthetic Minority Over-Sampling Technique (SMOTE). Yunxin Xie et al. [10] proposed a semi-supervised lithology identification model. The framework establishes a high-quality baseline model by adjusting ensemble algorithms (such as random forests and gradient boosting decision trees) through Bayesian optimization. A self-training strategy is used to increase the number of labeled samples, and the highest confidence prediction labels are used as pseudo-labels to reduce bias accumulation. Quan Ren et al. [11] combined fuzzy theory, decision trees, and the K-means++ algorithm to propose a new hybrid lithology identification technique, which improves the lithology identification accuracy through fuzzification processing and a fuzzy decision tree model. In actual testing, the model achieved a prediction accuracy of 93.92%. Zerui Li et al. [12] used the Laplacian Support Vector Machine (LapSVM), introduced feature similarity and depth similarity, and combined K-means clustering to select the samples that needed labeling, thereby improving the classification performance. However, machine learning-based methods heavily rely on high-resolution images and additional feature extractors. In the harsh conditions of mining environments (such as high dust levels, low light, and dusty weather), the collection of high-resolution images becomes exceedingly challenging due to environmental factors. This also highlights the limitations of these machine learning-based methods. Additionally, machine learning algorithms not only require a substantial number of high-resolution images for training but also depend on additional feature extractors to capture key features from the images, which presents significant challenges in practical mining operations.

In recent years, as a significant branch of artificial intelligence, deep learning has achieved remarkable progress in the field of computer vision. Image classification, image segmentation, and object detection are regarded as the three classic tasks in computer vision and are also typical applications of deep learning in image processing. Image classification refers to categorizing images into different classes or labels. The classic image classification methods include LeNet-5 [13], VGG [14], GoogLeNet [15], ResNet [16], and DenseNet [17]. In 2022, Zhenhao Xu et al. [18] proposed an intelligent lithology identification method based on deep learning, using seven convolutional neural networks, including Xception and MobileNet v2, for microscopic image classification. They improved the model accuracy through the cross-entropy loss function and frequent iterative training and expanded the dataset using transfer learning and data augmentation. Le Gao et al. [19] employed the ResNet-50 neural network model for data training and prediction research. Yang Liu et al. [20] constructed four CNN models of different depths and structures based on VGG Net, Inception Net, and ResNet for multi-coal-type, multi-class image classification.

Although image classification tasks can accurately output image labels, they overlook the specific location and shape information of objects within the image. Image segmentation divides an image into several semantically meaningful regions, aiming to classify the pixels in the image so that each region corresponds to a specific object or part of the background. Depending on the task, image segmentation can mainly be divided into semantic segmentation and instance segmentation. Semantic segmentation assigns each pixel in the image into a predefined category, thereby identifying different objects and parts of the background in the image. Semantic segmentation focuses on pixel-level category classification without distinguishing between different instances of the same category. The classic semantic segmentation methods include FCNs [21], DeepLab [22], and U-Net [23]. In 2022, Xinlei Nie et al. [24] combined an FCN with ResNet50 and proposed a quartz sand particle size detection method based on the FCN-ResNet50 deep learning semantic segmentation network. Huizhong Liu et al. [25] used the DeepLab v3+ deep learning semantic segmentation model to identify images of mineral belts. Jiaxu Duan et al. [26] designed a lightweight U-Net deep learning network to automatically detect iron ore particles in images and obtain probability maps of the particle contours, achieving good segmentation performance. Instance segmentation also assigns each pixel in the image into a category, but unlike semantic segmentation, it needs to distinguish between different instances of the same category. Instance segmentation assigns a unique identifier to each object instance, meaning that even different object instances of the same category will be assigned different identifiers. Mask R-CNN is a typical instance segmentation algorithm. Luo Xiaoyan et al. [27] used Mask R-CNN for ore identification and localization, achieving a comprehensive accuracy of 97.6%. Although image segmentation can perform pixel-level identification and classification, its effectiveness is highly dependent on the acquisition environment, requiring adjustments and optimizations under specific conditions. High-resolution images and detailed segmentation processing generate large amounts of data, necessitating efficient data processing and storage capabilities.

Compared to image classification and image segmentation, object detection combines the functionalities of classification and localization, allowing the extraction of multiple types of information within a single task, which makes it highly practical in complex application scenarios. The objective of object detection is to identify and localize target objects in images or videos. Deep learning-based object detection algorithms can be categorized into single-stage and two-stage methods based on their network structure and detection principles [28]. In 2022, Wang Zhitao et al. [29] used the YOLOv4 object detection algorithm to train a classifier that included seven common types of ore, capable of classifying and predicting the positions of various ore samples in images. Hou Zhenlong et al. [30] developed a lithology recognition model by improving Single Shot MultiBox Detector (SSD) object detection algorithm. Although single-stage algorithms have a speed advantage, two-stage algorithms, due to their two-step processing mechanism, are better suited to applications that prioritize accuracy and stability, especially in complex backgrounds and multi-object detection tasks. Therefore, this study adopts the Faster R-CNN model, a two-stage object detection algorithm, as the focus of the research. In lithology recognition tasks, many scholars have also conducted studies on the Faster R-CNN algorithm, as shown in Table 1.

Although the aforementioned studies demonstrate excellent performance, they are mostly based on staged images and often overlook the impact of noise on the models. Faster R-CNN generates candidate regions through the Region Proposal Network (RPN) and performs classification and localization within these regions. While Faster R-CNN performs well in object detection tasks, it faces significant challenges in lithology recognition within mining environments. Harsh working conditions, a wide variety of rock types, and unstable mining conditions can all affect the accuracy and efficiency of lithology recognition. To address these issues and ensure the model maintains high performance in complex environments, this study introduces targeted improvements into the Faster R-CNN model.

The structure of this paper is organized as follows: Section 2 provides a detailed overview of the Faster R-CNN model, including its components and their functions. Section 3 outlines the specific improvements made to the model. Section 4 describes the dataset construction process, experimental design, and model training procedure. Section 5 presents an analysis and discussion of the model’s prediction results. Finally, Section 6 summarizes the research findings and suggests directions for future research.

2. The Model Structure of Faster R-CNN

2.1. Overall Framework Structure

Faster R-CNN consists of a Region Proposal Network (RPN) and Fast R-CNN [34]. The input image is first passed through a backbone network, typically a convolutional neural network (CNN), for feature extraction. After multiple layers of convolution and pooling operations, a high-dimensional feature map is generated. The RPN scans the feature map using a sliding window and calculates the probability of an object being present (the object score), as well as the coordinates of the bounding box. Based on these scores, the top N anchors with the highest confidence are selected as candidate regions. Non-Maximum Suppression (NMS) is then applied to reduce redundancy and eliminate highly overlapping proposals. These candidate regions are subsequently resized to a fixed dimension using ROI pooling. ROI pooling divides the candidate regions into multiple subregions and performs max pooling within each subregion to ensure consistent output feature map dimensions. The features obtained through ROI pooling are then fed into the classification and regression heads of the detection network, which consists of fully connected layers. Each candidate region is classified, including background detection, and bounding box regression is performed to optimize the position and size of the bounding box. Through this workflow, Faster R-CNN effectively performs object detection, providing highly accurate detection results. The architecture of Faster R-CNN is illustrated as shown in Figure 1.

2.2. The Backbone Network

The backbone network is a critical component of Faster R-CNN, primarily responsible for extracting key features from the input image. These features are subsequently used to generate region proposals and perform object classification. Typically, the backbone network is composed of a deep convolutional neural network (CNN) that effectively captures both high-level semantic information and low-level details in the image. A high-quality backbone network can extract richer and more discriminative features, thereby significantly improving the accuracy of object detection. For example, ResNet introduces residual connections, allowing the network to be deeper and capture more complex and fine-grained features. This characteristic makes the backbone network instrumental in enhancing model performance, leading to more accurate and reliable object detection.

2.3. The Regional Proposal Network—RPN

The Region Proposal Network (RPN) is a key technology for object detection. It generates candidate object regions (also known as proposal boxes) on the input image, providing crucial inputs for subsequent object classification and bounding box regression. The working principle of the RPN is relatively complex, involving multiple steps such as feature extraction, candidate region generation, classification, and regression. First, the RPN generates a set of predefined boxes called anchors on the input image. These anchors cover various possible object shapes and sizes, including different aspect ratios and scales. The anchors can be generated at fixed positions on the image or using different sizes and aspect ratios. Next, the input image is passed through a pre-trained convolutional neural network (CNN) for feature extraction. Commonly used CNNs include VGG and ResNet. The purpose of feature extraction is to convert the input image into a series of feature maps that contain the semantic and structural information of the image. The anchors generated are mapped to corresponding positions on the feature maps so that each anchor corresponds to a position on the feature map. When performing subsequent operations on the feature maps, the features at each anchor position are typically processed using convolution operations to extract information related to that position. These features are used for classification and regression tasks. At each anchor position, the RPN processes the features through two parallel convolutional neural network branches: the classification branch uses convolutional layers to output the probability of each anchor containing an object. Typically, binary classification is performed to determine whether each anchor is a candidate region containing an object. The regression branch uses convolutional layers to output parameters that adjust the position of each anchor’s bounding box to fit the actual object’s location better. A regression model is usually employed to predict the position offsets of the bounding box. Based on the output of the classification branch, anchors with higher probabilities are selected as candidate object regions. Then, the positions of these candidate regions are adjusted according to the output of the regression branch to obtain the final proposal boxes. Additionally, non-maximum suppression (NMS) is applied to the proposal boxes to filter out overlapping ones and retain the most representative proposal boxes. The principle of RPN anchor box generation is illustrated as shown in Figure 2.

2.4. ROI Pooling

ROI pooling (region of interest pooling) is a commonly used operation in object detection. ROI pooling maps the coordinates of candidate regions from the original image to the convolutional feature map. Assuming the size of a candidate region is (h, w), it is divided into a fixed number of sections (such as 7 × 7 or 14 × 14). Within each section, max pooling is performed, which reduces the feature values in the section to a single value. This means that the features within each section are compressed into a fixed value. After the pooling operation, the feature maps of all the candidate regions are converted into the same fixed size (such as 7 × 7 or 14 × 14). ROI pooling transforms candidate regions of different sizes into feature maps of the same fixed size, making subsequent processing with fully connected layers simpler and more efficient. ROI pooling was first introduced in Fast R-CNN and has become a fundamental component of detection frameworks such as Faster R-CNN. The principle of ROI pooling is illustrated as shown in Figure 3.

2.5. Classification and Regression Networks

The primary tasks of classification and regression networks are target classification (i.e., identifying the target category) and bounding box regression (i.e., accurately locating the target within each candidate region). After the ROI pooling operation, the fixed-size feature maps are converted into one-dimensional vectors. These vectors then pass through multiple fully connected layers for further feature extraction and transformation. The classification sub-network typically consists of one or more fully connected layers, culminating in a softmax layer that predicts the target category for each candidate region. The output of the classification network is a probability distribution, indicating the likelihood that each candidate region belongs to a specific category. This probability distribution effectively aids the model in distinguishing between different target categories, thereby enhancing the classification accuracy. Similarly, the regression sub-network also comprises one or more fully connected layers and concludes with a linear layer that predicts the regression parameters of the bounding box. The output regression parameters include four values (dx, dy, dw, dh), representing the offset of the bounding box’s center point and adjustments to its width and height. These parameters enable the model to precisely adjust the position and size of the bounding box, ensuring accurate target localization. By seamlessly integrating these two sub-networks, Faster R-CNN can achieve precise target classification and localization within images. The classification network ensures that the target category of each candidate region is correctly identified, while the regression network further refines the bounding box coordinates, significantly enhancing the accuracy and reliability of object detection.

2.6. Loss Function

The loss function in Faster R-CNN consists of four loss components, corresponding to the classification and regression tasks of the Region Proposal Network (RPN) and the detection network, respectively. The total loss function of Faster R-CNN is shown in Equation (1):

L = L_{c l s}^{R P N} + λ_{1} L_{r e g}^{R P N} + L_{c l s} + λ_{2} L_{r e g}

(1)

where L represents the total loss of the model,

L_{c l s}^{R P N}

represents the classification loss of the RPN, and

L_{r e g}^{R P N}

represents the regression loss of the RPN.

L_{c l s}

represents the classification loss of the detection network, and

L_{r e g}

represents the regression loss of the detection network.

λ_{1}

and

λ_{2}

are hyperparameters that balance the classification and regression losses, typically set to 1.

L_{c l s}^{R P N} = - \frac{1}{N_{c l s}} \sum_{i} [p_{i}^{*} l o g (p_{i}) + (1 - p_{i}^{*}) l o g (1 - p_{i})]

(2)

In this equation,

p_{i}

is the predicted probability that the i-th anchor is a foreground object,

p_{i}^{*}

is the true label for the i-th anchor (1 for positive, 0 for negative), and

N_{c l s}

is the number of anchors.

L_{r e g}^{R P N} = \frac{1}{N_{r e g}} \sum_{i = 1}^{N_{r e g}} p_{i}^{*} \sum_{j \in {x, y, w h}} R (t_{i j} - t_{i j}^{*})

(3)

where

t_{i j}

represents the predicted regression parameters for the i-th anchor,

t_{i j}^{*}

represents the ground truth regression parameters for the i-th anchor, and

N_{r e g}

is the number of positive anchors; R denotes the smooth L1 loss function.

L_{c l s} = \frac{1}{N_{c l s}} \sum_{i = 1}^{N_{r e g}} (\sum_{c = 1}^{C} p_{i c}^{*} l o g (p_{i c}))

(4)

In Equation (4),

p_{i c}

represents the predicted probability of the i-th anchor belonging to class c;

p_{i c}^{*}

is the ground truth class of the i-th anchor; and

N_{c l s}

is the number of anchors.

L_{r e g} = \frac{1}{N_{r e g}} \sum_{i = 1}^{N_{r e g}} p_{i}^{*} \sum_{j \in {x, y, w h}} R (t_{i j} - t_{i j}^{*})

(5)

In Equation (5),

t_{i j}

is the predicted regression parameter for the i-th anchor;

t_{i j}^{*}

is the ground truth regression parameter for the i-th anchor;

N_{r e g}

is the number of anchors with targets; and R denotes the smooth L1 loss function.

3. The Improved Faster R-CNN Model

In complex and harsh environments such as mines, object detection faces numerous challenges. The main issues are as follows:

In underground mines or other environments with poor lighting conditions, insufficient light can significantly reduce the contrast of the target objects in images. This makes it difficult for traditional image processing algorithms to distinguish between the target and the background. Especially for edge detection algorithms, the boundaries of the target objects become blurred. In such cases, object detection algorithms need to be capable of working effectively under low-light conditions.
Dust and other particulate matter in the mining environment further degrade image quality. Dust not only obscures target objects, causing information loss in local areas, but also accumulates on camera lenses, resulting in image distortion and blurring. This severely affects the segmentation accuracy of object detection models, making it difficult for the model to accurately locate and identify target objects.
In mining environments, vibrations from equipment or transportation by conveyor belts can cause dynamic blurring of targets in images. Blurring caused by vibration and movement makes the outlines of the target objects unclear, thereby affecting the accuracy of object detection models. This dynamic blurring not only impacts the quality of static images but also degrades the image quality of continuous frames in video streams.

To address these challenges, this paper improves the Faster R-CNN model, making it more suitable for detection and recognition tasks in complex mining environments. The improved network model is shown in Figure 4.

3.1. Backbone Network Improvements

To achieve a better feature extraction performance and balance the network’s depth and width, this paper selects Res2Net-50 combined with an improved FPN as the backbone network architecture. Res2Net is an enhanced convolutional neural network architecture designed to improve the model’s feature representation capabilities and performance. By introducing a multi-scale feature expression approach, Res2Net significantly enhances the network’s feature extraction ability. Unlike traditional convolutional layer structures, Res2Net incorporates multiple scales of feature extraction within a single convolutional layer. It achieves this by dividing the feature channels into multiple groups and introducing different scales among these groups for feature extraction. Structural diagrams of the ResNet [35] and Res2Net [36] modules are shown in Figure 5.

Combined with Figure 5, the standard ResNet bottleneck block expression is as follows:

F (x) = W_{3} σ (W_{2} σ (W_{1} x)) \dots

(6)

y = F (x) + x

(7)

where

W_{1}

,

W_{2}

,

W_{3}

… represent different convolution kernels, and σ represents the non-linear activation function (ReLU). The expression for the Res2Net bottleneck block is as follows:

x = [x_{1,} x_{2,} \dots, x_{s}]

(8)

F (x) = [f_{1} (x_{1}), f_{2} ([x_{1}, x_{2}]), \dots, f_{s} ([x_{1,} x_{2,} \dots, x_{s}])]

(9)

y = F (x) + x

(10)

where

f_{i}

represents the i-th sub-feature map, and f represents the convolution operation applied to the i-th sub-feature map. F(x) represents multi-scale convolution operations. Specifically, the input x is divided into multiple sub-feature maps

x_{1,} x_{2,} \dots, x_{s}

, each of which undergoes convolution operations and is subsequently merged. By introducing multi-scale feature extraction, Res2Net increases the diversity and density of the features, enhancing the model’s ability to represent objects of various sizes. Res2Net retains the modular design of ResNet, allowing for easy integration into existing deep learning architectures while providing enhanced feature representation capabilities. In contrast, ResNet performs feature extraction at a single scale. When targets span across different scales, Res2Net’s multi-scale feature extraction can capture these targets more effectively, improving both the classification and detection performance. By incorporating multiple convolutional kernel connections, Res2Net can extract and merge multi-scale features within a single layer, thereby enhancing the information representation capability.

The FPN (Feature Pyramid Network) is an enhanced network designed to improve the detection of small objects and the overall detection accuracy through multi-scale feature fusion [37]. The FPN consists of two main components: a bottom-up pathway and a top-down pathway. The bottom-up pathway extracts features using convolutional layers of different sizes (C2, C3, C4, C5) and performs multi-scale feature fusion. The top-down pathway fuses these multi-scale features from larger feature maps into smaller feature maps using lateral connections (P2, P3, P4, P5). By incorporating a bottom-up pathway, the FPN introduces more detailed information, combining the abstract semantics of high-level features with the detailed features of low-level features (P2, P3, P4, P5), forming an efficient detection framework. The structure of the improved FPN is shown in Figure 6.

By integrating Res2Net’s multi-scale feature representation with an improved FPN design, the diversity and richness of features are significantly enhanced. The improved FPN is capable of comprehensively capturing and utilizing both detailed and semantic information within the images. This enhancement makes the improved FPN more suitable for target detection in mining environments, particularly under challenging conditions such as low light, high dust levels, and motion blur.

3.2. ROI Align

ROI Align (region of interest align) is a feature alignment technique used in object detection tasks to address the accuracy loss inherent in RoI pooling [38]. In object detection, it is often necessary to extract features of the Regions of Interest (RoIs) from the feature map for subsequent classification and bounding box regression. Traditional RoI pooling methods divide the RoI region into fixed-size grids and perform pooling operations on the features within each grid. This can lead to information loss and quantization errors, thereby reducing the detection accuracy. ROI Align the improves detection accuracy by achieving more precise feature alignment through bilinear interpolation of the positions within the RoI. Given the input feature map and the coordinates of the RoIs, the RoI region is divided into equally spaced subregions. Typically, the granularity of this division is finer than that of RoI pooling. For each pixel within the subregions, the bilinear interpolation method is used to calculate the corresponding feature values based on their positions in the original feature map. This ensures that the coordinates within the RoI can correspond one to one with the coordinates in the original feature map, avoiding the quantization errors seen in RoI pooling. Within each subregion, pooling operations are performed on the feature values obtained through bilinear interpolation. Usually, the pooled feature value (e.g., max pooling) of each subregion is taken as the output feature for that subregion. The output features of all subregions are then concatenated to form the final output feature of ROI Align. These features preserve the spatial structure and detailed information within the RoI, preventing the information loss associated with RoI pooling. The principle of ROI Align is illustrated as shown in Figure 7.

3.3. The Efficient Channel Attention Module

Efficient channel attention (ECA) is an efficient channel attention mechanism designed to enhance the model performance by simplifying the calculation process without adding extra parameters or computational overhead. The ECA mechanism primarily captures the relative importance of each channel through local cross-channel interaction [39].

The core idea of ECA is to use a one-dimensional convolution operation along the channel dimension to achieve local interaction, thereby generating attention weights for each channel. First, global average pooling is performed on the input feature map to obtain the global information for each channel. The calculation expression is as follows:

z_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} x_{i j c}

(11)

where

x_{i j c}

represents the value of the input feature map at the (i, j) position in channel c, and H and W denote the height and width of the feature map, respectively.

z_{c}

represents the global average-pooled value of channel c.

Next, a one-dimensional convolution operation is applied to the globally average-pooled feature z to capture the local dependencies among the channels. The calculation expression is as follows:

s = σ (C o n v 1 D (z, k) \sum_{i = 1}^{H} \sum_{j = 1}^{W} x_{i j c})

(12)

where Conv1D(z, k) represents a one-dimensional convolution operation with kernel size k, σ denotes the Sigmoid activation function, and s represents the attention weight of each channel. Finally, the attention weight of each channel is applied to the corresponding channel of the input feature map to obtain the recalibrated feature map. The calculation expression is as follows:

{\bar{x}}_{i j c} = s_{c} \cdot x_{i j c}

(13)

where

{\bar{x}}_{i j c}

represents the recalibrated feature value after applying the channel attention weight.

The ECA mechanism significantly reduces the computational costs by introducing a 1D convolution operation, especially when dealing with a large number of channels. Unlike traditional global pooling operations, ECA achieves similar effects with less computational effort. ECA calculates the inter-channel relationships using local information instead of processing the entire channel globally. This locality helps retain more feature information, making it particularly suitable for handling mining image data. By effectively modeling the dependencies between channels, ECA precisely adjusts the importance of each channel, thereby enhancing the feature representation capability. The ECA mechanism adaptively adjusts the channel weights to highlight important features while suppressing irrelevant ones. In low-light and dusty environments, ECA can enhance the extraction of key information, enabling the model to capture useful features even in complex conditions. In situations with dynamic blur and material occlusion, ECA reduces the interference of irrelevant information with the detection results by emphasizing important features. The structure of the ECA mechanism is illustrated in Figure 8.

3.4. Soft-NMS

Soft-NMS (Soft Non-Maximum Suppression) is a method used in object detection to suppress multiple detections of the same object [40]. Traditional NMS works by directly removing bounding boxes that have an overlap greater than a certain threshold, while Soft-NMS gradually reduces the confidence scores of highly overlapping bounding boxes, making it possible for these boxes to be removed in subsequent processing steps. Compared to NMS, Soft-NMS demonstrates greater robustness in handling complex scenarios. The core of this algorithm lies in using a confidence decay function to adjust the confidence scores of overlapping bounding boxes. The specific calculation process is as follows: first, the detection boxes are sorted according to their confidence scores. The bounding box with the highest score, M, is selected as the current reference box. For all the remaining bounding boxes, their confidence scores are updated based on the Intersection over the Union (IoU) with reference box M. The bounding box list is then updated by removing those with scores below a certain threshold. Soft-NMS typically employs two types of decay functions: linear decay and Gaussian decay. These two decay functions are described as follows.

s_{i} = s_{i} \times (1 - I o U (M, D_{i}))

(14)

where

s_{i}

represents the confidence score of the

i

-th bounding box,

M

is the current highest scoring bounding box, and

D_{i}

is the i-th bounding box.

I o U (M, D_{i})

denotes the Intersection over the Union between bounding boxes

M

and

D_{i}

.

s_{i} = s_{i} \times e x p (- \frac{I o U (M, D_{i})}{σ})

(15)

where exp represents the exponential function, and σ is a hyperparameter controlling the decay rate. Although NMS (Non-Maximum Suppression) is logically simple, is easy to implement, and integrates well with other systems, it may remove some true positive detection boxes in scenarios with dense or highly overlapping objects, as it strictly eliminates overlapping boxes. In contrast, Soft-NMS preserves more potential targets through a gentle confidence decay mechanism

4. Materials and Methods

4.1. Dataset

4.1.1. Dataset Collection

Since open-source rock datasets are primarily designed for image classification and often contain images that are small in size and unsuitable for the algorithms used in this study, a new approach was necessary. To address this challenge, this paper developed an algorithm focused on object detection and segmentation tasks. However, the existing open-source ore datasets are largely inadequate for such tasks. To overcome this limitation, we combined web scraping and field photography to collect over 600 high-quality images of specific minerals. These images cover six types of rocks: marble (100 images), granite (100 images), sandstone (100 images), limestone (100 images), schist (100 images), and basalt (100 images). The dataset collected was then divided into training, validation, and testing sets at a 7:2:1 ratio. Additionally, we generated several multi-class mixed rock images, which were directly incorporated into the testing set for evaluation. A portion of the original data is shown in Figure 9 to provide a clearer understanding of the dataset’s composition and diversity.

4.1.2. Image Annotation

In the task of object detection in image processing, it is essential to provide bounding boxes for the objects, thus requiring annotation of the dataset images. This study employs the LabelMe annotation tool (software version: 5.5.0, manufacturer or laboratory: MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), Cambridge, MA, USA) for image annotation. LabelMe is an open-source image annotation tool suitable for creating annotated datasets for object detection tasks. It offers an intuitive user interface that allows users to manually annotate images accurately and supports various common annotation formats. In this research, the LabelMe tool was used to annotate the images, and the annotated dataset was then converted into the Pascal VOC dataset format.

4.1.3. Simulated Mine Scene Noise

The images in the constructed rock dataset were staged and lacked the interference of other noise factors. In contrast, photos taken in real mining environments are often affected by dust, insufficient lighting, and other factors, resulting in noisy images. To make the experimental data more representative of real-world conditions, this study simulates the effects of real mining environments by adding the following types of noise:

Dust noise simulation: Randomly add small spots or particles to the images to mimic dust particles. This type of noise can be achieved by overlaying the images with randomly distributed small dots or specks.
Brightness and contrast adjustment: Adjust the brightness and contrast of the images, lowering the overall brightness and increasing shadow areas to simulate the insufficient lighting in mining environments. Additionally, random dark spots can be added to simulate uneven lighting conditions.
Lens smudge noise: Simulate lens smudges by adding irregularly shaped semi-transparent patches or blurred areas to the images. These smudges can appear as randomly distributed blurred spots or larger semi-transparent coverage areas, mimicking dust, dirt, or water droplets on the camera lens.
Dynamic blur noise: Simulate blur effects caused by motion or changes in the image or signal processing. In mining environments, this blur effect can result from conveyor belts, vibrations from mining equipment, or other dynamic changes.

Some simulated noise images of the mining environment are shown in Figure 10.

Subsequently, noise will be added to the test dataset to generate another test dataset for model prediction. This noise addition ensures that the experimental data more accurately reflect the shooting conditions in mining environments, making the evaluation data more representative and challenging. This process aims to verify the model’s robustness and anti-interference capabilities.

4.1.4. Data Augmentation

After adding simulated noise, the total dataset reached 840 images (620 original images + 240 noise-added images, with 60 images each for simulated dust, low brightness, lens smudges, and dynamic blur). However, this is still far from sufficient for training the Faster R-CNN model, including the improved version, due to the network’s large number of parameters. Given the extensive parameters, training requires a significant amount of labeled input data. Annotating ore images is time-consuming and labor-intensive, resulting in relatively limited annotated data. To effectively utilize the limited annotated ore images for model training, this study employs data augmentation techniques, including fixed-angle rotation, translation, and scaling. Through this approach, the total number of images in the original dataset was expanded to 5280. This method effectively increases the dataset’s diversity, enhancing the model’s generalization capability. The overall data processing workflow is illustrated as shown in Figure 11.

4.2. The Hardware and Training Framework

The training platform used in this study is a high-performance computer running the Windows 10 operating system specifically configured to support complex computational tasks. In terms of its hardware, the computer is equipped with an AMD EPYC 9654 central processing unit (CPU) and an NVIDIA GeForce RTX 4090 graphics processing unit (GPU) (manufacturer or laboratory: NVIDIA Corporation, Santa Clara, CA, USA) with 24 GB of memory, ensuring efficient performance in handling large-scale data and complex computations. For the software environment, PyTorch 2.3.0 (manufacturer or laboratory: Meta AI (formerly Facebook AI Research), Menlo Park, CA, USA) was selected as the deep learning framework, combined with CUDA 12.1 to fully leverage the computational power of the NVIDIA GPU, thereby accelerating the model training process. The entire experiment was coded in Python 3.12.0, ensuring both the stability and compatibility of the code.

5. Experimental Results and Analysis

5.1. Model Training and Parameter Configuration

In this experiment, a Faster R-CNN model for ore recognition was developed using the PyTorch deep learning framework and the Python programming language. The model was optimized by integrating the Adam optimization algorithm with a cyclic learning rate decay strategy. The initial learning rate was set to 0.001, with a step size of 10 and a decay rate (gamma) of 0.1. This means that after the 10th epoch, the learning rate decreases to 0.0001, and after the 20th epoch, it further decreases to 0.00001, continuing this pattern. The entire training process was accelerated on the GPU, with a total of 9,450 iterations.

The training process of the loss function in the optimized model was visualized using the matplotlib plotting tool, as shown in Figure 12.

The curve reflects the total loss, which includes the sum of all the loss components. Initially, the total loss is relatively high, approximately around 2.0, but it quickly decreases. After 2000 iterations, the total loss gradually stabilizes and remains at a lower level (approximately below 0.5), despite some fluctuations. This indicates that the overall performance of the model is continuously improving, with the loss steadily decreasing. The curves of the various loss factors in the total loss are shown in Figure 13.

The RPN regression loss curve reflects the loss of bounding box regression in the Region Proposal Network (RPN). Initially, the loss is relatively high, around 0.5, but it gradually decreases. After 2000 iterations, the loss stabilizes and remains at a low level (approximately below 0.05), despite some fluctuations. This indicates that the accuracy of the RPN’s bounding box positioning is continuously improving. The RPN classification loss curve reflects the loss of objectness in the RPN. The loss is initially high, around 0.7, but it quickly decreases. After 2000 iterations, the loss stabilizes and remains at a low level (close to 0), indicating that the model quickly learns to identify the target areas in the images. The detection network classification loss curve reflects the classification loss. Initially, the loss is relatively high, around 1.4, but it quickly decreases. After 2000 iterations, the loss stabilizes and remains at a low level (approximately below 0.1), despite some fluctuations. This indicates that the model is gradually learning and optimizing its ability to classify targets. The detection network regression loss curve reflects the loss of bounding box regression in the final detection network. Initially, the loss is relatively high, around 0.25, but it gradually decreases and stabilizes. After 2000 iterations, the loss remains at a low level (approximately below 0.05), despite some fluctuations. This indicates that the model is continuously optimizing the positions of the bounding boxes.

Overall, Faster R-CNN shows a gradual decrease and stabilization in all the loss components during the training process, indicating that the model is continuously learning and optimizing. After 2000 iterations, all the loss components stabilize, suggesting that the model has reached a stable state and performs reliably. Despite some fluctuations, the overall trend in the losses is downward, indicating that the model maintains good optimization effects throughout the training process.

5.2. Subjective Evaluation Analysis

In the subjective evaluation experiments, this study provides a detailed comparison between the performance of the original Faster R-CNN model and the improved Faster R-CNN model in the task of rock detection. To validate the detection accuracy of these two models, a series of experiments were carefully designed and analyzed from multiple perspectives. Specifically, the experimental design is illustrated in the following, Figure 14.

The images depict six different types of rock detection results, specifically for marble, granite, sandstone, limestone, schist, and basalt. Each set of images consists of three rows: the first row presents the original rock images, which serve as baseline references for comparison and analysis; the second row displays the detection results from the original Faster R-CNN model; and the third row presents the detection outcomes from the improved Faster R-CNN model. To further assess the robustness of the models under different environmental conditions, the second through fifth columns in each set of images display detection results under simulated mining noise conditions, including dust coverage, low lighting, lens smudges, and motion blur. In the noise-free original images, both the original and improved Faster R-CNN models generally detect the rock contours and positions with a reasonable degree of accuracy, particularly for rocks with distinct boundary features, such as marble, granite, and limestone. However, when the rock surface textures are complex or the rock colors are similar to the background (as seen with schist and basalt), the original model may exhibit some degree of missed or incorrect detections, while the improved model demonstrates greater adaptability, accurately identifying the rock boundaries. Under simulated mining noise conditions, such as dust coverage, low lighting, lens smudges, and motion blur, the detection performance of the original Faster R-CNN model significantly deteriorates. Particularly in scenarios involving motion blur and lens smudges, the original model is prone to environmental interference, leading to shifted or missed detection boxes. In contrast, the improved Faster R-CNN model maintains greater stability in these complex noisy environments, effectively filtering out the noise and accurately locating the rocks. This enhancement in robustness can likely be attributed to the model’s improved feature extraction and fusion techniques, as well as its stronger spatial alignment capabilities.

In the experiment on single-class rock (marble) detection (as shown in Figure 15), the original model demonstrated significant shortcomings when dealing with noise interference. On the noise-free dataset, the original model not only made classification errors but also failed to fully cover the target object with its generated bounding boxes, leaving the lower edge of the object outside the box. When dust noise was introduced, the performance of the original model further deteriorated, with continued classification errors, and the bounding boxes became smaller, shifted, or failed to completely enclose the target object, making it difficult to capture the full extent of the rock. These issues were particularly pronounced under low-light conditions and motion blur. Although the classification was correct in low-light conditions, the model detected only half of the target object, and under motion blur, parts of the bounding box still failed to capture the target. The model’s over-reliance on edge features may cause a lack of robustness in complex environments, especially under lens smudge interference, where the detection accuracy of the original model significantly dropped, leading to noticeable instances of missed detections. This indicates substantial limitations of the original model in handling such complex disturbances. In contrast, the improved Faster R-CNN model demonstrated significant advantages under the same conditions. Whether in low-light environments or under dust and lens smudge interference, the improved model was able to robustly identify the boundaries of the marble. The generated detection boxes not only closely aligned with the actual contours of the rock but also maintained high accuracy in complex backgrounds.

To comprehensively compare the performance of the original Faster R-CNN model and the improved model under these challenging conditions, this paper conducts a detailed evaluation of the detection performance on multi-class rock images, particularly focusing on scenes with complex backgrounds and environmental noise interference. The multi-class rock images tested often exhibit close contact between rocks, which increases the difficulty of detection. A randomly selected set of multi-class rock image prediction results is shown in Figure 16. The order of the prediction results is the same as that of the single-class rock images, with a comparison between the original model and the improved model on the original test set and the test set with simulated mining noise. To further evaluate the robustness of the models in different environments, the second to fifth columns of each set of images display the detection results under simulated mining noise conditions (including dust coverage, low illumination, lens smudging, and motion blur). The comparative analysis reveals that the detection performance of the original Faster R-CNN model significantly declines when faced with these complex environments. For instance, in images with low illumination and lens smudging, the original model’s bounding boxes exhibit noticeable shifts and incompleteness, including failure to detect some objects correctly. Additionally, dust coverage and motion blur have a substantial impact on the original model, leading to misaligned and overlapping detection boxes and a significant decrease in category accuracy. In contrast, the improved Faster R-CNN model demonstrates higher robustness and detection accuracy under the same conditions. The improved model not only accurately locates the boundaries of each rock but also effectively handles various types of noise interference, resulting in detection boxes that closely fit the target objects. The improved model’s performance is particularly notable under dust coverage and motion blur conditions, where the bounding boxes still accurately surround the target rocks without significant misalignment or missed detections. Especially under low illumination and lens smudging interference, the improved model maintains a high detection accuracy, indicating a significant enhancement in its ability to handle complex environments.

5.3. Objective Evaluation Analysis

5.3.1. Performance Metrics

This study employs a variety of performance metrics to comprehensively evaluate the detection performance of the model, including accuracy (AC), mean average precision (mAP), recall, and the F1 score. Specifically, [email protected] and [email protected] represent the mean average precision when the Intersection over the Union (IoU) threshold is set to 0.5 and 0.75, respectively. These metrics each assess different aspects of the model’s detection capabilities, providing a well-rounded evaluation of its effectiveness.

5.3.2. Comparative Experiment I

To comprehensively evaluate the feature extraction capabilities of the Res2Net network, this study conducts comparative experiments using Faster R-CNN with different backbone networks. Specifically, we compare Res2Net with other mainstream backbone networks, including ResNet and ResNeXt. Through these comparative experiments, we are able to analyze the performance differences in feature extraction among the various backbone networks in detail. The performance metrics of Faster R-CNN with different backbone networks are shown in Table 2.

As shown in the table, the recognition performance of Faster R-CNN varies with different backbone networks. ResNet-50, as the backbone network, exhibits stable performance and provides high detection accuracy, but its performance is somewhat lacking at higher IoU thresholds (e.g., 0.75). ResNeXt-50 enhances the feature extraction capabilities through its multi-path structure, resulting in a significant performance improvement over ResNet-50, particularly in mAP and [email protected]. Res2Net-50, by introducing a multi-scale feature extraction mechanism, significantly enhances the model’s feature representation ability, improving the performance metrics across the board, and performs exceptionally well in complex ore detection tasks. The Faster R-CNN model using Res2Net-50 as the backbone network demonstrates superior performance, surpassing ResNet-50 and ResNeXt-50 in all the performance metrics, indicating stronger feature extraction and representation capabilities. This suggests that Res2Net-50 can provide higher precision in handling complex mining environments and multi-scale object detection tasks.

5.3.3. Comparative Experiment II

In this study, we conducted ablation experiments to thoroughly investigate the impact of various modules on the model performance, including the Res2Net backbone network, the FPN module, the improved FPN module, and the ECA module. To comprehensively assess the effects of these modules, we tested these five models on both the original test set and the simulated mining noise test set, observing the performance improvements contributed by each module. In these experiments, we first trained a baseline model using the standard configuration of Faster R-CNN without a backbone network, with inputs directly fed into the RPN. Subsequently, we incrementally added the Res2Net backbone network, the FPN module, the improved FPN module, the ECA module, ROI Align, and Soft-NMS, forming different combinations of model configurations. The performance of these configurations was then evaluated in both testing environments. The results of the ablation experiments are shown in Table 3 and Table 4.

From the data in Table 3 and Table 4, it can be observed that the addition of various modules significantly enhances the detection performance of the model. In the original test set, the base Faster R-CNN model achieved a mAP of 0.451. After adding the Res2Net backbone, the mAP increased to 0.562, and with the further addition of the FPN module, the mAP reached 0.616. With the incorporation of the improved FPN module and the ECA module, the performance continued to improve, ultimately achieving a mAP of 0.723 after adding ROI Align and Soft-NMS. In the simulated mining noise test set, the base model’s mAP was 0.396. With the addition of the Res2Net backbone, it increased to 0.473 and further to 0.525 with the inclusion of the FPN module. The improved FPN module and the ECA module further enhanced the performance, and with the addition of ROI Align and Soft-NMS, the mAP reached 0.670. In summary, by gradually integrating different modules, this study significantly improved the detection performance of the Faster R-CNN model, especially in noisy environments. The combined application of the Res2Net backbone, the improved FPN module, the ECA module, ROI Align, and Soft-NMS resulted in substantial performance improvements in both test environments. This demonstrates that the improved model has better stability and stronger robustness.

5.3.4. Comparison Experiment III

In this study, we conducted experiments by individually adding various improvements to the Faster R-CNN model to evaluate the impact of each modification on the model’s complexity and performance. The specific improvements included the introduction of ROI Align, Soft-NMS, the use of Res2Net-50 as the backbone network, enhancements to the FPN module, the addition of the ECA module, and a final comprehensive improvement to the model. The table below shows the impact of these modifications on the model performance and complexity, with metrics including accuracy (mAP), the number of parameters (Param, in millions, abbreviated as M), frame rate (FPS, in frames per second), and the computational cost (FLOPs, in billions, abbreviated as B). The experiments were conducted using a simulated noisy dataset, and the detailed results are presented in Table 5.

The table data indicate that the detection performance (mAP) of Faster R-CNN improves significantly with the successive model enhancements. The baseline Faster R-CNN (using ROI pooling) achieves a mAP of 53.31%. Replacing ROI pooling with ROI Align increases the mAP to 54.66%, highlighting the advantage of ROI Align in feature alignment. The introduction of Soft-NMS further boosts the mAP to 54.73%. Incorporating Res2Net-50 leads to a significant increase in mAP to 57.26%, demonstrating that Res2Net’s multi-scale feature representation effectively enhances the model’s performance. Combining the improved FPN with Res2Net-50 results in a mAP of 56.32%, showing a slight decrease but still outperforming the initial enhancements. Introducing the ECA mechanism in conjunction with ROI Align increases the mAP to 57.14%, slightly higher than using Res2Net-50 alone, further optimizing the selectivity of the feature channels. Finally, the model integrating Res2Net-50, the improved FPN, ECA, and Soft-NMS achieves the highest mAP of 62.40%, indicating that this combination of improvements significantly enhances detection performance. Regarding the parameter count and computational cost, the baseline Faster R-CNN models (with both ROI pooling and ROI Align) have parameter counts and computational costs of 39.99 M and approximately 192 B, respectively. As the model is progressively improved, the number of parameters increases from 39.99 M to 45.53 M, and the computational cost rises from 191.17 B to 198.30 B. Although the complexity increases, the performance gains confirm the effectiveness of these improvements. In terms of the frame rate (FPS), the addition of multiple enhancement modules results in a slight decrease in the FPS from 20.03 to 19.62, indicating a minimal overall impact. Considering the substantial performance improvements, this trade-off in speed is acceptable. In summary, with successive enhancements—including the use of Res2Net-50, the improved FPN, ECA, and Soft-NMS—the mAP consistently increases, reaching a peak of 62.40%, which is significantly superior to the baseline version. While the number of parameters and computational cost increase with these improvements, they remain within a manageable range, with the parameters between 40 and 45 M and the computational cost between 191 and 198 B. The fully enhanced model (Faster R-CNN + Res2Net-50 + Improved FPN + ECA + Soft-NMS with ROI Align) achieves a notable improvement in accuracy (mAP). Despite the increase in computational complexity, the FPS remains within the range of 19–20, demonstrating that the improved version substantially enhances the detection accuracy while maintaining good real-time performance.

6. Conclusions

The improved Faster R-CNN model demonstrated outstanding performance in complex mining environments, significantly enhancing the accuracy and efficiency of ore recognition through a series of optimizations. Specific improvements include replacing the original ResNet with Res2Net, which greatly enhanced the feature extraction capability, enabling the model to capture multi-scale features better. Res2Net-50, combined with an improved Feature Pyramid Network (FPN) as the backbone, strengthened the multi-scale feature fusion, effectively improving the accuracy of small object detection. Additionally, the introduction of ROI Align to replace traditional ROI pooling resolved spatial misalignment issues, thereby enhancing the detection accuracy. The inclusion of the efficient channel attention (ECA) module optimized the feature selection for high-resolution semantic information, significantly enhancing the model’s robustness in challenging environments such as low light and high dust conditions. Meanwhile, replacing the original NMS with Soft-NMS allowed for the retention of more potential targets through a confidence decay mechanism, further improving the model’s detection accuracy and robustness.

The experimental results indicate that the improved model significantly outperformed the original model across multiple performance metrics. For instance, Faster R-CNN + Res2Net-50 achieved a mAP of 0.562 on the original test set, up from 0.451. With further integration of the improved FPN as the backbone, the mAP increased to 0.616. Incorporating ROI Align raised the mAP to 0.710. To further enhance its robustness, especially under low light and dusty conditions, the inclusion of the ECA module increased the mAP to 0.696. Finally, replacing the original NMS with Soft-NMS resulted in a mAP of 0.723, an F1 score of 0.688, and recall reaching 0.793, demonstrating the model’s exceptional performance in complex environments.

The improved model also performed well on a simulated mining noise test set. For example, the combination of Faster R-CNN + Res2Net-50 + Improved FPN + ECA + ROI Align + Soft-NMS achieved a mAP of 0.640, significantly higher than the unimproved model. The F1 score and recall reached 0.735 and 0.761, respectively, validating the effectiveness of each module in complex environments.

Overall, these improvements showcase the great potential of deep learning in lithology recognition, providing strong technical support for achieving intelligent and automated lithology recognition. However, this study also has some limitations. While the model performed excellently in experimental settings, practical deployment may face challenges such as limited hardware resources and environmental interference. Moreover, the model needs to integrate with existing mining control and monitoring systems, which may involve incompatibilities in the data interfaces and communication protocols, as well as stringent requirements for system security and stability. The main challenges include interface and protocol compatibility, system security and stability, and scalability.

Future research could address these challenges by developing standardized data interfaces and APIs and employing a modular microservices architecture to ensure seamless integration of the model with existing systems. Security measures such as authentication, data encryption, and access control should be implemented during the integration process to ensure system security and data privacy. Detailed simulations and tests should also be conducted before its formal deployment to ensure its stability and compatibility under various working conditions. Additionally, to further enhance the model’s generalization ability, exploring more diverse data augmentation strategies and semi-supervised learning methods will be necessary. Considering the limitations of solely relying on software for performance optimization, future efforts will also involve using industrial-grade camera and computing equipment that are dust-proof, shock-resistant, and high-temperature-tolerant to maximize the system performance and stability through a combination of hardware and software. Future research will continue to focus on optimizing the model’s environmental adaptability and computational efficiency, as well as exploring its cross-domain application potential to promote the practical use of lithology recognition technology in more complex scenarios.

Author Contributions

Conceptualization, P.F.; methodology, J.W.; validation, P.F. and J.W.; data curation, J.W.; writing—original draft preparation, P.F.; writing—review and editing, P.F. and J.W.; supervision, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The APC was funded by Shenyang University of Technology.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Dong, X.; Le, B.T.; Ha, T.T.L. Iron ore identification method using reflectance spectrometer and a deep neural network framework. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2021, 248, 119168. [Google Scholar]
Yu, J.; Xu, R.; Zhang, J.; Zheng, A. A review on reduction technology of air pollutant in current China’s iron and steel industry. J. Clean. Prod. 2023, 414, 137659. [Google Scholar] [CrossRef]
Suthaharan, S. Decision tree learning. In Machine Learning Models and Algorithms for Big Data Classification: Thinking with Examples for Effective Learning; Springer: New York, NY, USA, 2016; pp. 237–269. [Google Scholar]
Song, Y.-Y.; Ying, L.U. Decision tree methods: Applications for classification and prediction. Shanghai Arch. Psychiatry 2015, 27, 130–135. [Google Scholar] [PubMed]
Webb, G.I.; Keogh, E.; Miikkulainen, R. Naïve Bayes. Encycl. Mach. Learn. 2010, 15, 713–714. [Google Scholar]
Zhang, M.-L.; Zhou, Z.-H. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognit. 2007, 40, 2038–2048. [Google Scholar] [CrossRef]
Chandra, M.A.; Bedi, S.S. Survey on SVM and their application in image classification. Int. J. Inf. Technol. 2021, 13, 1–11. [Google Scholar] [CrossRef]
Zhang, Z.; Liu, Y.; Hu, Q.; Zhang, Z.; Liu, Y. Competitive voting-based multi-class prediction for ore selection. In Proceedings of the 2020 IEEE 16th International Conference on Automation Science and Engineering (CASE), Hong Kong, China, 20–21 August 2020. [Google Scholar]
Zhou, K.; Zhang, J.; Ren, Y.; Huang, Z.; Zhao, L. A gradient boosting decision tree algorithm combining synthetic minority oversampling technique for lithology identification. Geophysics 2020, 85, WA147–WA158. [Google Scholar] [CrossRef]
Xie, Y.; Jin, L.; Zhu, C.; Wu, S. A semi-supervised coarse-to-fine approach with bayesian optimization for lithology identification. Earth Sci. Inform. 2023, 16, 2285–2305. [Google Scholar] [CrossRef]
Ren, Q.; Zhang, D.; Zhao, X.; Yan, L.; Rui, J. A novel hybrid method of lithology identification based on k-means++ algorithm and fuzzy decision tree. J. Pet. Sci. Eng. 2022, 208, 109681. [Google Scholar] [CrossRef]
Li, Z.; Kang, Y.; Feng, D.; Wang, X.M.; Lv, W.; Chang, J.; Zheng, W.X. Semi-supervised learning for lithology identification using Laplacian support vector machine. J. Pet. Sci. Eng. 2020, 195, 107510. [Google Scholar] [CrossRef]
Liu, D.; Wang, A.; Wu, Y. Handwritten letter recognition using LetNET. In Proceedings of the 2nd International Conference on Artificial Intelligence, Automation, and High-Performance Computing (AIAHPC 2022), Zhuhai, China, 25–27 February 2022; SPIE: Bellingham, WA, USA, 2022; Volume 12348, pp. 264–269. [Google Scholar]
Sengupta, A.; Ye, Y.; Wang, R.; Liu, C.; Roy, K. Going deeper in spiking neural networks: VGG and residual architectures. Front. Neurosci. 2019, 13, 95. [Google Scholar] [CrossRef] [PubMed]
Al-Qizwini, M.; Barjasteh, I.; Al-Qassab, H.; Radha, H. Deep learning algorithm for autonomous driving using GoogLeNet. In Proceedings of the 2017 IEEE Intelligent Vehicles Symposium (IV), Los Angeles, CA, USA, 11–14 June 2017. [Google Scholar]
Sasha, T.; Almeida, D.; Lyman, K. Resnet in Resnet: Generalizing residual architectures. arXiv 2016, arXiv:1603.08029. [Google Scholar]
Zhu, Y.; Newsam, S. DenseNet for dense flow. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017. [Google Scholar]
Önal, M.K.; Avci, E.; Özyurt, F.; Orhan, A. Classification of minerals using machine learning methods. In Proceedings of the 2020 28th Signal Processing and Communications Applications Conference (SIU), Gaziantep, Turkey, 5–7 October 2020. [Google Scholar]
Xu, Z.; Ma, W.; Lin, P.; Hua, Y. Deep learning of rock microscopic images for intelligent lithology identification: Neural network comparison and selection. J. Rock Mech. Geotech. Eng. 2022, 14, 1140–1152. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, Z.; Liu, X.; Wang, L.; Xia, X. Ore image classification based on small deep learning model: Evaluation and optimization of model depth, model structure and data size. Miner. Eng. 2021, 172, 107020. [Google Scholar] [CrossRef]
Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. In Proceedings of the Advances in Neural Information Processing Systems 29, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic image segmentation with deep convolutional Nets, Atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
Siddique, N.; Paheding, S.; Elkin, C.P.; Devabhaktuni, V. U-net and its variants for medical image segmentation: A review of theory and applications. IEEE Access 2021, 9, 82031–82057. [Google Scholar] [CrossRef]
Nie, X.; Zhang, C.; Cao, Q. Image segmentation method on quartz particle-size detection by deep learning networks. Minerals 2022, 12, 1479. [Google Scholar] [CrossRef]
Liu, H.; You, K. Research on image multi-feature extraction of ore belt and real-time monitoring of the tabling by semantic segmentation of DeepLab V3+. In Proceedings of the International Conference on Artificial Intelligence and Security, Quinghai, China, 15–20 July 2022; Springer International Publishing: Cham, Switzerland, 2022. [Google Scholar]
Duan, J.; Liu, X.; Wu, X.; Mao, C. Detection and segmentation of iron ore green pellets in images using lightweight U-net deep learning network. Neural Comput. Appl. 2020, 32, 5775–5790. [Google Scholar] [CrossRef]
Luo, X.; Liu, S.; Tang, W.; Wang, X. Research on identification and location of blocked ore at ore bin inlet based on Mask RCNN. Nonferrous Met. Sci. Eng. 2022, 13, 101–107. [Google Scholar]
Zhou, X.; Koltun, V.; Krähenbühl, P. Probabilistic two-stage detection. arXiv 2021, arXiv:2103.07461. [Google Scholar]
Wang, T. Ore Detection Method Based on YOLOv4. In 3D Imaging—Multidimensional Signal Processing and Deep Learning: 3D Images, Graphics and Information Technologies; Springer Nature: Singapore, 2022; Volume 1, pp. 245–257. [Google Scholar]
Hou, Z.; Wei, J.; Shen, J.; Liu, X.; Zhao, W. Intelligent lithology identification methods for rock images based on object detection. Nat. Resour. Res. 2023, 32, 2965–2980. [Google Scholar]
Liu, X.; Wang, H.; Jing, H.; Shao, A.; Wang, L. Research on intelligent identification of rock types based on faster R-CNN method. IEEE Access 2020, 8, 21804–21812. [Google Scholar] [CrossRef]
Xu, Z.; Ma, W.; Lin, P.; Shi, H.; Pan, D.; Liu, T. Deep learning of rock images for intelligent lithology identification. Comput. Geosci. 2021, 154, 104799. [Google Scholar] [CrossRef]
Pham, C.; Zhuang, L.; Yeom, S.; Shin, H.-S. Automatic fracture detection in CT scan images of rocks using modified faster R-CNN deep-learning algorithm with rotated bounding box. Tunn. Undergr. Space 2021, 31, 374–384. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Koonce, B. ResNet 50. In Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization; Apress: Berkeley, CA, USA, 2021; pp. 63–72. [Google Scholar]
Gao, S.H.; Cheng, M.M.; Zhao, K.; Zhang, X.Y.; Yang, M.H.; Torr, P. Res2net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 652–662. [Google Scholar] [CrossRef]
Li, Y.; Zhou, S.; Chen, H. Attention-based fusion factor in FPN for object detection. Appl. Intell. 2022, 52, 15547–15556. [Google Scholar] [CrossRef]
Gong, T.; Chen, K.; Wang, X.; Chu, Q.; Zhu, F.; Lin, D.; Yu, N.; Feng, H. Temporal ROI Align for Video Object Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; AAAI Press: Palo Alto, CA, USA, 2021; Volume 35, pp. 1442–1450. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 11534–11542. [Google Scholar]
Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS–Improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5561–5569. [Google Scholar]

Figure 1. Faster R-CNN model structure.

Figure 2. Principle diagram of RPN anchor generation.

Figure 3. Principle diagram of ROI pooling.

Figure 4. Improved Faster R-CNN model structure.

Figure 5. ResNet (left) and Res2Net (right).

Figure 6. Diagram of the improved FPN structure.

Figure 7. Principle diagram of ROI Align (bilinear interpolation).

Figure 8. ECA structure diagram.

Figure 9. Some of the original image data.

Figure 10. Partially simulated noisy images of mining environments.

Figure 11. Data processing workflow framework.

Figure 12. Total loss function curve.

Figure 13. In a left-to-right and top-to-bottom sequence, the figures are as follows: the RPN classification loss curve, the RPN regression loss curve, the detection network classification loss curve, and the detection network regression loss curve.

Figure 14. The six images presented in a left-to-right and top-to-bottom sequence illustrate the detection results for six types of rocks: marble, granite, sandstone, limestone, schist, and basalt, respectively.

Figure 15. The marble rock detection image (light blue circles indicate anomalous points).

Figure 16. The prediction of multi-class rock images involves more complex scenarios.

Table 1. Several scholars have conducted research on Faster R-CNN.

Scholars	Dataset Types	Technical Methods	Main Results
Xiaobo Liu et al. [31]	Private dataset (captured)	Faster R-CNN + VGG16	Improved the recognition accuracy and processing speed.
Zhenhao Xu et al. [32]	Private dataset (captured and obtained from the web)	Faster R-CNN + ResNet-50	In comparison with YOLOv4, the Faster R-CNN algorithm demonstrated higher accuracy and stability.
Pham Chuyen et al. [33]	Private dataset (captured)	Faster R-CNN + ResNet-101	The mean average precision (mAP) reached as high as 0.89.

Table 2. Detection performance of Faster R-CNN with different backbone networks.

Algorithm	Backbone	AC	mAP	[email protected]	[email protected]
Faster R-CNN	ResNet-50	0.703	0.503	0.592	0.515
Faster R-CNN	ResNeXt-50	0.745	0.527	0.614	0.532
Faster R-CNN	Res2Net-50	0.797	0.562	0.644	0.551

Table 3. Performance metrics on the original test set.

Algorithm	mAP	[email protected]	[email protected]	F1	Recall
Faster R-CNN (without backbone)	0.451	0.560	0.403	0.645	0.670
Faster R-CNN + ResNe2t-50	0.562	0.644	0.551	0.722	0.728
Faster R-CNN + Res2Net-50 + FPN	0.616	0.687	0.602	0.735	0.758
Faster R-CNN + Res2Net-50 + Improved FPN	0.659	0.725	0.631	0.755	0.772
Faster R-CNN + Res2Net-50 + Improved FPN+ECA	0.696	0.760	0.667	0.763	0.779
Faster R-CNN + Res2Net-50 + Improved FPN + ECA + ROI ALign	0.710	0.772	0.671	0.779	0.786
Faster R-CNN + Res2Net-50 + Improved FPN + ECA + ROI ALign + Soft-NMS	0.723	0.785	0.688	0.793	0.822

Table 4. Performance metrics of simulated mine noise test set.

Algorithm	mAP	[email protected]	[email protected]	F1	Recall
Faster R-CNN (without backbone)	0.396	0.430	0.373	0.508	0.525
Faster R-CNN + ResNe2t-50	0.473	0.525	0.443	0.593	0.610
Faster R-CNN + Res2Net-50 + FPN	0.517	0.562	0.480	0.636	0.652
Faster R-CNN + Res2Net-50 + Improved FPN	0.564	0.615	0.521	0.678	0.690
Faster R-CNN + Res2Net-50 + Improved FPN + ECA	0.603	0.652	0.563	0.708	0.729
Faster R-CNN + Res2Net-50 + Improved FPN + ECA + ROI ALign	0.621	0.676	0.574	0.716	0.742
Faster R-CNN + Res2Net-50 + Improved FPN + ECA + ROI ALign + Soft-NMS	0.640	0.686	0.593	0.735	0.761

Table 5. Performance and complexity of various improved models.

Algorithm	mAP/%	Parm/M	FLOPs/G	FPS/s-1
Faster R-CNN (ROI pooling)	53.31	39.99	191.17	20.03
Faster R-CNN (ROI Align)	54.66	39.99	192.20	20.25
Faster R-CNN + Soft-NMS (ROI Align)	54.73	39.99	192.54	20.17
Faster R-CNN + Res2Net-50 (ROI Align)	57.26	42.32	195.58	19.83
Faster R-CNN + Improved FPN (ROI Align)	56.32	41.56	195.49	19.79
Faster R-CNN + ECA (ROI Align)	57.14	40.01	193.33	20.94
Faster R-CNN + Res2Net-50 + Improved FPN + ECA + Soft-NMS (ROI Align)	62.40	45.53	198.30	19.62

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fu, P.; Wang, J. Lithology Identification Based on Improved Faster R-CNN. Minerals 2024, 14, 954. https://doi.org/10.3390/min14090954

AMA Style

Fu P, Wang J. Lithology Identification Based on Improved Faster R-CNN. Minerals. 2024; 14(9):954. https://doi.org/10.3390/min14090954

Chicago/Turabian Style

Fu, Peng, and Jiyang Wang. 2024. "Lithology Identification Based on Improved Faster R-CNN" Minerals 14, no. 9: 954. https://doi.org/10.3390/min14090954

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lithology Identification Based on Improved Faster R-CNN

Abstract

1. Introduction

2. The Model Structure of Faster R-CNN

2.1. Overall Framework Structure

2.2. The Backbone Network

2.3. The Regional Proposal Network—RPN

2.4. ROI Pooling

2.5. Classification and Regression Networks

2.6. Loss Function

3. The Improved Faster R-CNN Model

3.1. Backbone Network Improvements

3.2. ROI Align

3.3. The Efficient Channel Attention Module

3.4. Soft-NMS

4. Materials and Methods

4.1. Dataset

4.1.1. Dataset Collection

4.1.2. Image Annotation

4.1.3. Simulated Mine Scene Noise

4.1.4. Data Augmentation

4.2. The Hardware and Training Framework

5. Experimental Results and Analysis

5.1. Model Training and Parameter Configuration

5.2. Subjective Evaluation Analysis

5.3. Objective Evaluation Analysis

5.3.1. Performance Metrics

5.3.2. Comparative Experiment I

5.3.3. Comparative Experiment II

5.3.4. Comparison Experiment III

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI