1. Introduction
With the rapid development of the manufacturing industry, an increasing demand for advanced machining technologies has been observed. In the machining process, wear on cutting tools lead to a reduction in product quality and a decrease in production efficiency [
1,
2]. Therefore, tool-condition monitoring (TCM) plays a crucial role in providing valuable guidance for the reasonable use of tools. For companies, a reliable TCM has a significant production value due to its potential in preventing unplanned downtime and avoiding corresponding economic losses [
3,
4]. Considering the dynamic and nonlinear nature of the tool wear process, influenced by complex working environments, the accurate prediction of tool wear continues to face significant challenges [
5].
Facing predictive challenges, this study employed deep learning techniques. Unlike traditional machine learning models that require pre-defined feature extraction, deep learning autonomously learns from complex data, crucial for understanding nonlinear and multivariate tool wear processes [
6]. Its robust adaptability and generalization also effectively handle real-world data inconsistencies and noise. This study aimed to develop a deep learning-based TCM model to enhance the accuracy and efficiency of cutting tool wear prediction, with convolutional neural networks (CNNs) [
7] and recurrent neural networks (RNNs) [
8] being the most utilized in recent studies.
CNNs have demonstrated superior feature extraction capabilities in large-scale image recognition tasks due to their unique convolutional and pooling layers [
9]. For example, Kumar et al. [
10] employed a deep CNN architecture using images of surfaces machined without cutting fluid as inputs. By selecting the right training parameters, they classified cutting tool wear, reaching a model recognition and classification accuracy of 99.9%. Additionally, Lim et al. [
11] conducted a comparison between DNN and CNN networks in the field of tool wear recognition. They found that CNNs are more reliable in using cropped images of machined surface contours to predict the amount of flank wear on tools during turning processes, achieving an accuracy rate of 98.9% and an average test RMSE of 2.0969. Meanwhile, García-Pérez et al. [
12] employed multi-view camera technology, supplemented by data augmentation and class weighting, to manage the number of worn tools assessed and the costs associated with image collection. They considered and tested two CNN architectures, reaching an experimental accuracy as high as 97.8% (with a Matthews correlation coefficient of 0.955), and they were able to detect defects in various blade types. Zhang et al. [
13] obtained the initial dataset through wavelet transformation, followed by the use of a conditional variational autoencoder with CNN to augment the dataset, addressing the issue of data imbalance. This augmented dataset served as the input for a CNN. Subsequently, they described tool wear using a multistage nonlinear Wiener process model. Brili et al. [
14] implemented an infrared camera for process monitoring, capturing the visual and thermal states during the cutting process, and created a dataset with more than 9000 images. Using a CNN, they developed a predictive model for tool wear and tool damage. The model automatically assesses the condition of cutting tools (ranging from no wear to high wear) using thermal imaging data, with a classification accuracy of 99.55%. While CNNs have made advancements in the prediction of tool wear, existing models mainly focus on the spatial correlations of machining signals, often overlooking the inherent temporal associations and dynamic features within the signals. This bias leads to current CNN models’ difficulty in effectively addressing long-term dependency issues, which are crucial for analyzing physical quantities in tool operations as these quantities are information-rich along the temporal dimension.
The core characteristic of RNNs lies in their hidden layers, which allow the model to consider the temporal order and dependencies between current input and historical information when processing data [
15]. However, the main challenge that RNNs are prone to is gradient explosion, especially when there are more hidden layers. To address this issue, researchers have proposed variants such as long short-term memory networks (LSTMs) and gated recurrent units (GRUs). Shah et al. [
16] used sensors to capture acoustic emission and vibration signals, creating scaleograms with Morlet wavelets. They utilized the relative wavelet energy criterion to choose appropriate wavelet functions and employed SinGAN to produce extra scaleograms. Subsequently, they extracted several image quality parameters to build feature vectors, which were fed into a stacked LSTM model, achieving outstanding performance indicators. Li et al. [
17] used radar charts to integrate multi-source signal features and combined them with AdaBoost and Stacked BiLSTM for accurate tool wear prediction. Mahmood et al. [
18] employed the singular spectrum analysis algorithm to denoise and extract features from original force signal data. Utilizing principal component analysis techniques to reduce data dimensionality and one-hot encoding to transform the model’s target variables from text to binary numerical format, they inputted these data into a BLSTM model, which successfully recognized the state of the tools. Marani et al. [
19] proposed a predictive model based on LSTM for predicting tool flank wear in the machining process of steel alloys. They tested the LSTM model using the spindle motor current signals gathered during experiments performed on a lathe. Bilgili et al. [
20] developed a neural network based on LSTM architecture to predict tool flank wear using measured spindle motor current and dynamometer signals. Although RNNs excel in processing time-series data, particularly in capturing temporal correlations and long-term dependencies, they still encounter limitations in the field of tool wear prediction. Existing RNN models typically process sequence data at fixed time steps, a structure that restricts RNNs from naturally adapting to and capturing the multi-scale features present in sequence data. This limitation inevitably leads to a significant loss of valuable information and results in an incomplete representation of features.
Although all of these methods employ variants of CNN or RNN and combine them with other unique processing to obtain good results, they may still not be able to fully learn all the relevant information as they are mainly based on a single network structure, which may limit the performance of the models. Furthermore, the stability of these models in harsh environments remains to be further verified. Therefore, deeper exploration into network structures is crucial.
In recent years, the combination of CNNs and RNNs has been widely explored in various fields, as this combination is able to take advantage of the unique strengths of both networks in feature extraction. Marei et al. [
21] developed a hybrid CNN-LSTM model that incorporates a transfer learning mechanism, using multimodal data from cutting tools. They employed a pre-trained ResNet-18 CNN model to extract features from visual inspection images of the cutting tools. They implemented transfer learning based on maximum mean discrepancy to adapt the trained model specifically for cutting tools. Zhou et al. [
22] used a GRU to capture the temporal dependence in the tool cutting signal and then used CNN to extract multidimensional features, which were mapped to tool wear values by linear regression. Si et al. [
23] proposed BiLPReS, a novel predictive model that utilizes a hybrid architecture integrating LSTM, an encoder actuator, and residual skip connections. Compared with CNNs and RNNs, this model achieves global perception of long-range dependencies and parallel computation. An et al. [
24] first extracted local features using CNN and reduced the dimensionality, then stacked BiLSTM with LSTM for denoising and coding, followed by multiple fully connected layers and regression layers to predict the remaining useful life of the tool. Bazi et al. [
25] decomposed signals into a sub-time-series known as intrinsic mode functions through variational mode decomposition. Using these intrinsic mode functions as inputs, they successfully achieved relatively accurate tool wear predictions by employing a combination of CNN and BiLSTM. While the fusion of CNNs and RNNs in the field of tool cutting wear prediction has shown significant effectiveness, the current mainstream architecture of serial models exhibits clear limitations. Specifically, the architecture where the output of one model serves sequentially as the input for the subsequent model leads to a key issue: errors at each stage may be cumulatively amplified in subsequent stages, thereby affecting the accuracy of the final output. Moreover, this serial dependency nature restricts the model’s parallel processing capabilities, further reducing computational efficiency.
To address the above issues, this paper introduces a novel approach named parallel convolutional and recurrent neural networks with attention-modulated residual learning (ParaCRN-AMResNet). The framework adopts a parallel structure that integrates multi-scale dilated CNN modules with BiGRU modules. In addition, residual blocks with an attention mechanism are introduced to compensate for uncaptured critical information, using standard residual blocks to accelerate convergence and stabilize the computation. Global average pooling (GAP) is employed to identify and retain the most representative local spatial features while also reducing the spatial dimensions of the feature mappings. The selected prominent features are fused through a fully connected layer, outputting the predicted tool wear amount.
The main contributions are as follows:
A new tool wear prediction method has been proposed that completes wear prediction through an end-to-end mechanism, significantly surpassing traditional sequential deep learning models, especially in terms of processing speed improvement.
A parallel architecture is adopted, enabling independent feature capture among CNN modules, residual blocks, and RNN modules, which significantly enhances the model’s computational efficiency and accuracy and reduces error accumulation.
Different sizes of dilated convolution structures and BiGRU structures have been designed to capture feature information across various time dimensions, effectively solving the time-dependency issues found in traditional models.
Effective attention units have been integrated, with SimAM emphasizing and highlighting key features, while ResNeSt compensates for potentially uncaptured critical information, further enhancing prediction accuracy and the model’s noise resistance capability.
The remainder of this paper is structured as follows.
Section 2 thoroughly discusses the detailed structure of the functional modules and overall framework of the proposed deep learning model.
Section 3 details the construction of the experimental rig and delves into an in-depth analysis of the experimental results on tool wear prediction, thereby confirming the efficacy and noise resistance capability of the proposed ParaCRN-AMResNet model.
Section 4 presents some important conclusions of the paper.
2. Proposed Methodological Framework
This section primarily introduces the methodological principles employed by the ParaCRN-AMResNet model. It combines the advantages of dilated CNN and BiGRU models in a parallel structure. The model conducts in-depth spatio-temporal feature extraction through the SimAM module and a series of dilated CNN layers, utilizing ResNeSt for feature refinement and focus. Meanwhile, a Seq2Seq-structured BiGRU processes sequential data, capturing temporal features across different scales. The two streams are then merged, with final predictions being performed through a fully connected layer.
2.1. Dilated CNN
Dilated CNN has been recognized as a significant technological advancement in the field of deep learning. Its advantage lies in systematically enlarging the receptive field of the convolutional kernel without adding any extra parameters [
26,
27]. This adjustment allows for a deeper and broader exploration of the input features, thereby improving the model’s capacity for processing time-series data. Focusing on the architecture of the dilated CNN, it is characterized by its ability to modify the layout of the convolutional kernel to enhance functionality. This design not only enhances the model’s ability to capture long-term dependencies but also maintains computational efficiency.
In terms of technical details, the key difference between dilated CNN and traditional CNN is the specific interval arrangement of elements within the convolutional kernel, with the scale of this interval being defined by the dilation rate. For the dilated CNN of 1D data, the specific layout can be exhaustively described by the following equation:
where
y[
i] is the output feature at position
i;
x is the input feature;
k is the weights in the convolution kernel;
j is traversing all positions of the convolution kernel;
i is the current data point position;
d is the dilation rate, defining the interval between weights in the convolution kernel; and
M is the length of the convolution kernel. To further elucidate,
Figure 1 provides an intuitive illustration. In this figure,
Figure 1a depicts a standard convolution kernel covering a 5 × 5 feature range;
Figure 1b shows a convolution kernel with a dilation rate of 2, where each pair of adjacent elements has a clear gap, thus expanding its receptive field to 9 × 9;
Figure 1c presents a convolution kernel with a dilation rate of 4, where there are 3 gaps between each pair of adjacent elements, leading to a further increase in the receptive field to 17 × 17.
2.2. Global Average Pooling
GAP is a variant of conventional pooling, and its general structure can be seen in
Figure 2 [
28]. It is positioned at the end of the CNN, following the last convolutional layer. Unlike the traditional flatten layer, the introduction of the GAP layer aims to effectively reduce the number of parameters in the model, mitigate overfitting, and enhance the model’s global understanding of spatial features in the input data. For time-series data, GAP1D is commonly used, where the average is calculated as follows:
where
F is the feature vector;
L is the length of the feature vector; and
fi is the
i-th feature value in the vector. Specifically, the GAP layer performs a global average pooling operation on the feature maps output by the last convolutional layer, transforming each feature map into a single numerical value. This not only simplifies the subsequent processing steps but also retains essential spatial information within the feature maps. Hence, the GAP layer serves as a key transition point within the model’s structure and acts as a bridge between feature extraction and classification decision making.
2.3. Bidirectional Gated Recurrent Unit
GRUs [
29] are the same as the LSTM network, which was proposed to address the issues of long-term memory and gradient problems in traditional RNN networks during backpropagation. These units are designed to process sequential data, particularly in contexts requiring the capture of long-term dependencies. GRU controls the flow of information by introducing a gating mechanism, with its structure being shown in
Figure 3. For a sequence
x = (
x1,
x2,
x3, …,
xt), where
xt is the input at time step
t, the update gate
zt determines the extent to which the hidden state from the previous time step
ht−1 is retained in the current time step:
The reset gate
rt controls the influence of the previous time step’s hidden state
ht−1 on calculating the candidate hidden state
at the current time step:
The candidate hidden state
is calculated based on the reset previous hidden state and the current input, providing a candidate value for the new hidden state:
The final hidden state
ht is determined through interaction with the update gate, which dictates the proportion of the previous hidden state to be retained and the extent of the new candidate hidden state to be incorporated:
where
σ is the sigmoid activation function, used to control the flow of information;
tanh is the hyperbolic tangent activation function; ⨀ is the Hadamard product;
Wz,
Wr, and
W are the weight matrices for the respective gates; and
bz,
br, and
b are the bias vectors.
BiGRU deploys two independent GRUs at each time point, one processing the forward flow of the sequence and the other handling the backward flow, enabling it to encode both forward and backward information of a sequence simultaneously. This bidirectional structure allows the model to understand the data from two directions, providing a more comprehensive analysis of the sequence. Its structure is shown in
Figure 4. The hidden states of both forward and backward directions can be calculated by the above Equations (3)–(6). The overall hidden state at time point
t is given by:
where
is the forward hidden state at time
t;
is the hidden state at moment
t in the reverse direction.
Compared with the network structure of LSTM, GRU only contains reset and update gates, while LSTM has a forget gate, an input gate, and an output gate. This means that GRU has fewer model parameters and a more streamlined network structure under the premise of the same number of hidden units. Based on these advantages, this paper selects BiGRU and designs it as a Seq2Seq structure, which can learn directly from the source sequence to the target sequence without the need for manually designing complex features. Additionally, the model can remember and utilize long-distance dependency information in the input sequence.
2.4. Residual Network
The cornerstone of ResNet [
30] is the residual block, which brings a structural contribution by introducing a direct channel for information flow in deep networks. This design allows the network’s original input to be directly transmitted to subsequent layers through shortcut connections, thereby effectively facilitating the backpropagation of gradients. In deep CNNs, the training process is prone to gradient vanishing or exploding as the network depth increases. The introduction of residual blocks ensures that gradients can be transmitted without obstacles, even in very deep networks, significantly enhancing the model’s training efficiency and stability. The structural design of the residual block allows each block to directly utilize information from previous layers, not solely relying on the outcomes of the current layer’s processing. Specifically, a residual block can be described as follows:
where
H(x) is the final output mapping;
F(x) is the residual mapping; and
x is the identity mapping.
In this paper, ResNeSt as an improved version of the residual block is also used, with its structure being shown in
Figure 5a [
31]. It is used in parallel with a dilated convolutional neural network, a structural arrangement that enables the ResNeSt block to more effectively supplement feature information that the main network might miss and helps to avoid the problem of gradient explosion. Within it, the input features are first divided into multiple cardinals, and each cardinal is further divided into radix groups; the convolution and split-attention operations are performed separately, and then all outputs are concatenated with another layer of convolutional outputs. The core of this process is called split-attention, whose internal structure is shown in
Figure 5b. Given a set of input tensors [
X1,
X2, …,
Xr] with the dimensions [
h ×
w ×
c], a global average pooling operation is applied to each tensor
Xi, which is:
where
i = 1, 2, …,
r; subsequently, each
Pi undergoes a fully connected operation followed by a batch normalization (BN) operation, and is then activated:
then, for each output
Di, three different fully connected operations are applied to generate a set of attention weights:
where
j = 1, 2, 3; then, an r-Softmax operation is used to normalize the weights to obtain
Sij.
Sij is used to weigh the input
Xi, and all results are summed to obtain the final output, as shown in (12) and (13).
The effectiveness of residual networks in various computer vision tasks has been widely demonstrated.
2.5. Attention Mechanisms for Convolutional Part
Attention mechanisms as an emerging layer have been proven to enable models to focus more on the relevant parts of a task, thus achieving significant results in improving model performance. To avoid introducing excessive computational overhead while ensuring performance enhancement, this paper adopts the SimAM attention mechanism proposed by Yang et al. [
32]. Its purpose is to emphasize important features closely related to the task and suppress redundant features within the convolutional module. In contrast to the existing attention modules, which are mainly based on channels and spatial dimensions, the SimAM mechanism focuses on adjusting the weights of the feature maps to enhance feature discriminability and does not add extra parameters to the original network. Its specific structure is shown in
Figure 6.
The SimAM attention mechanism is based on neuroscience theory, and its core is the ‘energy function’, which is expressed as:
where
et is the energy;
wt and
bt are the weighting and bias transformations;
t and
xi represent the target neuron and other neurons in a single channel of the input features, respectively;
i is the index in the spatial dimension;
M is the number of neurons in that channel;
λ is a normalization parameter; and
μt and
σ2 are the mean and variance computed for all neurons in the channel excluding
t. This function quickly calculates the energy of each neuron, thereby determining its importance. Because SimAM is designed with reference to the attention mechanism of mammals, it employs a scaling operator to represent the brain’s gain effect
on neuronal responses, expressed as:
where
E groups all minimum energy differences across channels and spatial dimensions, and then a sigmoid function is used to limit excessively large values in
E. ⨀ is the Hadamard product. It is used to emphasize task-relevant important features and suppress redundant features in the convolutional module.
2.6. Parallel Modelling Structure
As previously mentioned, CNNs and RNNs each have their unique advantages, but different types of neural networks also have their limitations. A single module may struggle to capture all information, limiting its effectiveness in complex applications. Therefore, hybrid models that combine multiple network architectures are particularly crucial. These models often integrate the advantages of various structures and can even, to some extent, compensate for certain deficiencies.
The structure widely used by current researchers is the sequential stacking of networks; two sequential stacking structures are shown in
Figure 7. Different types of networks are sequentially organized according to their feature extraction capabilities. Although this structure has been proven to be effective to a certain extent, there are still challenges with this sequential structure in series. Due to the inherent sequential dependency, the performance of subsequent networks is largely limited by the feature extraction effectiveness of preceding networks. This effect, known as ‘error accumulation’, can amplify minor errors from one module to the next, significantly reducing overall performance. Furthermore, the computational efficiency of the sequential structure model is inherently constrained by the order of computation.
In response to these challenges, this study proposes a novel parallel network structure as shown in
Figure 8. The structure adopts a multi-input strategy, where the original features are input into the CNN part and the RNN part for independent feature learning, respectively, after undergoing a simple pre-processing. Throughout the supervised model training process, each input stream remains independent, avoiding mutual interference. After a series of operations, all extracted features are eventually concatenated into a 1D vector, preparing the ground for subsequent tool wear analysis.
2.7. Proposed Model Structure
This paper proposes a novel model, ParaCRN-AMResNet, depicted in
Figure 9. The model consists of three main parts: the CNN part, the RNN part, and a fully connected layer. The specific implementation process is as follows:
- (1)
The model employs wavelet transformation and concatenation to preprocess the signal, ensuring a comprehensive multi-scale feature input.
- (2)
It uses dilated CNN and ResNeSt blocks in a parallel layout to extract diverse scale features without cross-interference. The integration of the SimAM attention mechanism selectively focuses on crucial features, streamlining the feature set.
- (3)
A Seq2Seq BiGRU module is in parallel, aligned with the CNN layers. This configuration efficiently captures temporal features, with a varying number of units in hidden layers to address different time scales.
- (4)
The outputs from the CNN and RNN segments are combined, which is followed by a fully connected layer, to accurately predict tool wear.
The model’s parallel computing architecture allows for independent and effective feature extraction, merging the benefits of dilated CNN and BiGRU. The SimAM attention mechanism enhances focus on task-relevant features, improving the model’s sensitivity and precision. The incorporation of ResNeSt blocks supplements the model, ensuring no critical feature is overlooked. An increasing dilation rate in the dilated CNN captures detailed features, ranging from localized to broader contextual information. Concurrently, the BiGRU module, with its hidden layers gradually decreasing in size, adeptly captures temporal information across varying scales, enhancing the model’s ability to process complex time-series data.
In summary, the ParaCRN-AMResNet model encapsulates a blend of innovative techniques and structures, enhancing its performance in tool wear prediction.
3. Experiment
In this section, the primary focus encompasses the experimental conditions related to tool wear data, the selection of parameters for the ParaCRN-AMResNet model, the assessment of model noise resistance capability, and the ultimate prediction results.
3.1. Experiment Setup
Experiments are conducted using the IEEE PHM Challenge 2010 dataset [
33] to validate the performance of the proposed ParaCRN-AMResNet model. The specific layout of the experimental setup is shown in
Figure 10. The experiment utilized a Röders Tech RFM 760 CNC machine tool, and the selected tool was a three-flute carbide ball-end mill with a cutting length of 108 mm. The workpiece material was stainless steel, and the relevant cutting parameters are listed in
Table 1. In order to measure the cutting forces, a three-directional piezoelectric dynamometer from Kistler was installed between the machine tool and the workpiece. At the same time, three Kistler accelerometers were mounted on the workpiece to measure vibration signals, and an acoustic emission (AE) sensor was used to capture elastic waves generated by stress changes. All of these sensor signals were amplified and collected through a Kistler 5019A multi-channel charge amplifier and DAQ Ni PCI1200 data acquisition card, with a sampling rate set at 50 kHz. After each cutting operation, the wear of the mill’s flank face was measured using a Leica MZ12 microscope, and this measurement served as the target value for each sample. The IEEE PHM Challenge 2010 dataset consists of six subsets (C1 to C6), and each subset contains data from 7 different sensor signals. Among these, subsets C1, C4, and C6 additionally include corresponding tool wear measurements, while subsets C2, C3, and C5 do not contain such data. Based on these considerations, this study selected subsets C1 and C6, which contain tool wear data, as the training set and used the C4 subset as the test set for subsequent model validation.
3.2. ParaCRN-AMResNet Traning and Testing Procedure
To build and assess the performance of the training and testing models, the data used are an already labeled dataset. During the training period for supervised model hyperparameter optimization, considering that the task is tool wear prediction, the loss function was set as the mean squared error (MSE), which is defined as:
where
n is the number of samples;
yi is the actual value of the
i sample; and
is the predicted value for the
i sample.
Adaptive moment estimation (Adam) was chosen as the optimizer, known for accelerating gradient descent, thereby enabling efficient and robust training acceleration [
34]. The activation function selected was Swish [
35], which is expressed as follows:
where
x is the input,
σ is the sigmoid function, and
β is an adjustable parameter, which for the sake of simplifying calculations was set to 1 in this paper. Compared with other activation functions, the Swish function is smoother and can effectively assist the optimizer in updating weights. The initial learning rate was set to 0.0005, the batch size was chosen as 16, the number of training epochs was fixed at 100, and dropout was set at 0.4. The model construction, training, and testing were all implemented using Python 3.10.12 and Keras 2.11.0, with the Keras backend being Tensorflow-gpu 2.11.0, on an Intel(R) Xeon(R) Gold 5318Y CPU @ 2.10 GHz processor and NVIDIA A100 PCIe 40 GB graphics card. The server’s operating system was Ubuntu 20.04.
3.3. Data Preprocessing
In the C1 and C6 datasets, 80% of the data were used for the training set, while the remaining 20% served as the validation set. Due to the sampling frequency set at 50 kHz, a large amount of data were generated in each cutting process, significantly increasing the time of model training. To alleviate this issue, a downsampling method was employed to extract 5000 equidistant data points from each signal for sequential concatenation. Additionally, signal processing utilized the Daubechies wavelet (db4) and a 2-level decomposition wavelet transform, enabling the model to capture both the general trends and the detailed information within the signal. Furthermore, the flank face wear values of the milling tool, measured after actual cutting operations, were used as sample labels to construct the tool wear dataset.
3.4. Evaluation Criteria
Four regression metrics were selected for the quantitative evaluation of the model’s predictive performance: mean absolute error (MAE), MSE, mean absolute percentage error (MAPE), and coefficient of determination r-squared (R
2). The standard formulas for MSE, MAPE, and R
2 are as follows:
The following experiments were repeated three times, and the final metrics are the average values of these three trials.
3.5. Hyperparameter Optimization
For the proposed ParaCRN-AMResNet network, its performance is primarily influenced by five key hyperparameters: (1) the dilation rate Dr in dilated convolution; (2) cardinality in ResNeSt, represented by Nc, which specifies the number of feature groups, with the radix set to 1 considering the volume of training data and computational efficiency; (3) the number of ResNeSt blocks Ns; (4) the number of ResNet blocks Nr; and (5) the number of units in the BiGRU hidden layer, Nh. The dilation rate directly affects the ability of the convolution module to perceive and capture temporal information, while the middle three parameters are directly linked to the model’s computational efficiency and representational and noise resistance capability; the number of units in the BiGRU layer influences the model’s information storage capacity. To thoroughly assess the impact of these five hyperparameters, experiments were conducted.
3.5.1. Selection of Dilation Rate
In dilated CNN, the dilation rate is used to increase the receptive field of the convolution operation, which means it directly impacts the convolution layer’s ability to capture information in the data, thus affecting the model’s perception of the entire dataset. How to balance the richness and scale of the features captured by the convolutional layer was the focus of this experiment. The dilation rates for the four convolution layers were limited to three options: [1, 1, 1, 1], [1, 2, 4, 8], and [1, 3, 6, 9]. Additionally, the cardinality
Nc in ResNeSt was tentatively set to 3 and the number of stacking layers
Ns to 10; the number of ResNet stacking layers
Nr was tentatively set to 4, and the number of units in the BiGRU hidden layer
Nh was set to 128. Other parameter settings can be found in
Table 2.
The experimental results are shown in
Table 3. It showcases the model’s response to different dilation rate configurations, with the [1, 2, 4, 8] setting demonstrating a notable enhancement in predictive performance. This configuration yields the most favorable outcomes in terms of MAE, MSE, MAPE, and R
2 metrics.
These metrics indicate the smallest discrepancy between the predicted values and the actual values, suggesting that the model performs best under this dilation rate configuration. This also highlights the importance of appropriately selecting dilation rates for optimizing the performance of CNNs. Specifically, the configuration of dilation rates [1, 2, 4, 8] most effectively balances the scope of feature capture and the preservation of detail, ensuring that the model can comprehensively and accurately process the critical information within the data. Consequently, [1, 2, 4, 8] will be chosen as the dilation rates for the convolutional layers in subsequent experiments.
3.5.2. Selection of Cardinality
In the ResNeSt architecture, cardinality is used to specify how many different groups to split the feature channel into. Increasing the number of cardinalities allows each group to specialize in learning specific features or attributes in the input data, enabling the model to capture the diversity and complexity of the data more finely. However, a higher cardinality value also leads to a linear increase in computational cost and may even result in decreased performance due to increased network complexity. Therefore, it is necessary to find a balance between computational cost and model performance. Given that the data have seven feature channels, the range for cardinality
Nc was set between [2, 3, 7]. Additionally, the number of stacking layers
Ns in ResNeSt was set to 10, the number of stacking layers
Nr in ResNets to 4, and the number of units in the BiGRU hidden layer
Nh to 128. Other parameter settings can be found in
Table 2. The final experimental results are shown in
Table 4.
As shown in
Table 4, when
Nc is set to 3, the model achieves the lowest values in MAE, MSE, and MAPE and the highest in R
2. When
Nc = 7, the model’s performance is slightly inferior to the setting of
Nc = 2. It is noteworthy that the computation time per epoch for
Nc = 7 is 108 s, which represents a significant increase in computational cost compared with 55 s for
Nc = 3 and 50 s for
Nc = 2.
The results indicate that while increasing cardinality can enhance the model’s ability to capture the diversity and complexity of input data, a higher cardinality value also linearly increases computational costs and may even lead to performance degradation due to increased network complexity. Therefore, a balance must be struck between computational cost and model performance. Considering both model performance and computational efficiency, setting the cardinality Nc to 3 is identified as the most appropriate choice.
3.5.3. Selection of ResNeSt Stacking Number
The number of stacking layers
Ns in ResNeSt not only affects the model’s ability to supplement and extract temporal features but also directly relates to computational costs and the risk of overfitting. To determine the optimal value for
Ns, a series of experiments are conducted. Based on preliminary results, the range of values for this parameter is limited to [8, 10, 12, 14]. Temporarily, the number of stacking layers
Nr in ResNets is set to 4, and the number of units in the BiGRU hidden layer
Nh is set to 128. Other parameter settings can be found in
Table 2. The final experimental results are shown in
Table 5.
According to the data in
Table 5. When the number of stacked layers
Ns in the ResNeSt architecture is set to 10, the model exhibits its best predictive performance. The specific performance metrics are as follows: MAE: 2.6015, MSE: 15.1921, R
2:0.9897, MAPE: 2.7997%.
The experimental results underscore the critical importance of optimizing Ns in the design process of this model. Specifically, the model needs to strike a balance between enhancing the capability to extract temporal features and controlling computational resource consumption to avoid overfitting. Among all tested configurations, setting Ns to 10 not only significantly optimized the model’s prediction accuracy but also demonstrated an effective compromise between increasing the depth of the model structure and maintaining computational efficiency. Therefore, 10 is determined as the best choice for Ns to be used in subsequent experiments.
3.5.4. Selection of ResNets Stacking Number
The number of stacking layers
Nr in ResNets and
Ns in ResNeSt similarly impact the model, primarily with respect to the learning of model feature hierarchies and the efficiency of model convergence. Based on the data from preliminary experiments, the range for
Nr is set to [2, 4, 6, 8]. Similarly, the number of units in the BiGRU hidden layer
Nh is set to 128. The settings for the other parameters can be found in
Table 2.
As shown in
Table 6, increasing the number of
Nr does not lead to an improvement in model performance; instead, a decrease in performance is observed. The model exhibits its best performance when
Nr is set to 4.
The phenomenon shown in the
Table 6 can likely be attributed to the increased complexity of the model, which becomes a burden in the presence of an insufficient number of training samples, leading to overfitting and a reduction in generalization ability. Therefore, finding a balance between model complexity and performance is particularly crucial. When
Nr = 4, all performance indicators are at their best, making this setting the optimal choice for subsequent experiments.
3.5.5. Selection of BiGRU Hidden Layers Number
The model with a higher number of hidden layers
Nh in BiGRU shows better performance in three aspects: temporal and sequential feature extraction, information storage, and model capacity. However, a higher number of
Nh also affects the training speed and practicality of deploying the model, making it more difficult to interpret. Therefore, an appropriate value for
Nh must be chosen based on experimental results. Based on experience, the range for
Nh is set to [64, 128, 256, 512]. Other parameters of the model are in accordance with
Table 2. The specific experimental results are presented in
Table 7.
From
Table 7, it can be observed that when
Nh = 128, the model demonstrates strong temporal and information storage capabilities, reaching its optimal capacity at this setting. When
Nh exceeds this value, MAE, MSE, and MAPE all increase and R
2 decreases.
When Nh = 128, it balances the model’s performance with the practicality of its training and deployment, avoiding overfitting issues that could arise from having too many hidden layers. Additionally, this setting ensures that the model has sufficient capacity to effectively process time-series data without sacrificing operational efficiency. Further considering the runtime of the model for each epoch, Nh = 128 is ultimately selected as the optimal number of hidden layers in BiGRU for use in subsequent experiments.
3.6. Ablation Study
To verify the necessity and impact of each component in the proposed ParaCRN-AMResNet model, a series of ablation experiments were conducted.
Initially, to validate the necessity of integrating the CNN and RNN modules, these two modules were individually removed from ParaCRN-AMResNet and tested separately. Then, ResNeSt was removed from the CNN module to evaluate its contribution to supplementing feature information. Subsequently, ResNet was removed from the model to assess its contribution to computational optimization. To further validate the importance of the residual structure in the model, both ResNeSt and ResNet were removed simultaneously, creating the ParaCRN-AM model. Additionally, a model without the attention mechanism was constructed by removing all attention mechanisms from the CNN module to assess their contribution during the convolution process. To ensure the effectiveness of the parallel structure, the CNN and RNN modules were concatenated sequentially, forming two sequential stacking models: CNN-RNN and RNN-CNN. Furthermore, to evaluate the differences in temporal information capture between BiGRU and traditional GRU, BiGRU in the original model was replaced with GRU, named GRU-AMResNet for experimentation. The number of hidden layer units in the RNN module was fixed to create the FixRNN model to assess the effectiveness of the structure designed for capturing temporal features of different scales. All of these experimental models were ensured to have parameters consistent with ParaCRN-AMResNet.
As shown in
Table 8, all deep learning network structures demonstrate relatively accurate predictive performance in the task of tool wear prediction. Notably, the proposed ParaCRN-AMResNet model achieves a standout performance with an MAE of 2.6015, MSE of 15.1921, R
2 value of 0.9897, and MAPE of 2.7997%.
Through the comparative analysis of metrics across distinct models and the model proposed herein, it is evidenced that the incorporation of residual blocks designed to apprehend features that might be omitted by dilated convolution neural network significantly augments the performance of the model. Furthermore, the introduction of the overall residual block structure into the network notably enhances model performance. The adoption of an attention convolution block structure is demonstrably pivotal to the model’s success, as evidenced by the inferior performance metrics of the without attention model. The performance of the FixRNN model underscores the importance of capturing temporal features at different scales, aiding the model in understanding multi-scale information from various periods. The overall structure of the model significantly impacts its performance, as seen in the CNN-RNN and RNN-CNN models, which show a marked decrease in performance, even more so than the without attention model, further validating the superiority of the proposed parallel structure.
Additionally, considering safety,
Figure 11 shows the comparison of predicted and actual wear values on the cutting blades with the greatest wear for the ParaCRN-AMResNet, without RNN, without CNN, without attention, and ParaCRN-AM models. Aligned with the metrics presented in
Table 8, variations in the accuracy of predictions across these models are evident, with the proposed ParaCRN-AMResNet model achieving the highest proximity to the actual values. These results underscore the model’s superior capability in capturing temporal features at varying scales and its efficiently designed parallel structure, positioning the ParaCRN-AMResNet model ahead of other reference models in terms of overall performance.
3.7. Comparative Experiments
To more comprehensively assess the performance of the model, PR-AUC [
36], CGRU-IConvGRU-A [
37], ConvLSTM-Att [
24], and MDMCNN-BiLSTM [
38] were used as benchmarks for comparison. The experimental settings follow those described in the original literature, using MAE and R
2 as performance metrics. All results are presented in
Table 9.
An analysis of the data from
Table 9 clearly indicates that the proposed model significantly outperforms the comparison group on key performance metrics. Compared with traditional sequential structure-based models such as PR-AUC, ConvLSTM-Att, and MDMCNN-BiLSTM, which are prone to accumulating errors during data transmission, thereby affecting prediction accuracy, the introduced ParaCRN-AMResNet model employs a parallel architecture design. This design enables independent parallel processing of various features, effectively preventing the common problem of error accumulation associated with sequential processing. Such parallel processing not only significantly enhances computational efficiency but also reduces the decline in predictive accuracy caused by error propagation. Although the CGRU-IConvGRU-A model also utilizes a parallel structure, the sequential arrangement of its internal CNN and GRU components does not fully eliminate inter-module interference.
Furthermore, compared with the 1D CNN used by CGRU-IConvGRU-A, ConvLSTM-Att, and MDMCNN-BiLSTM, the ParaCRN-AMResNet’s implementation of DCNN exhibits superior performance in processing time-series data, benefiting from its wider receptive field and deeper feature abstraction capabilities. While the comparison models attempt to capture multi-scale features using convolutional kernels of various sizes, the inherent limitations of their receptive fields render their performance inferior to that of ParaCRN-AMResNet. Although the PR-AUC model employs DCNN to capture time-series features, ParaCRN-AMResNet combines DCNN with a BiGRU structure. This integration allows the model to more effectively capture features across different temporal scales. The introduction of BiGRU enhances the model’s ability to capture long-term temporal dependencies, which is challenging to achieve with DCNN alone. Additionally, the multi-dimensional BiGRU in ParaCRN-AMResNet, based on a Seq2Seq structure, contrasts sharply with the fixed-size RNN architectures in other models, enabling the mentioned model to more effectively capture and utilize long-distance dependencies in time-series data.
Unlike the approach of ConvLSTM-Att and MDMCNN-BiLSTM models, which emphasize features at the end of the model using an attention mechanism, ParaCRN-AMResNet opts to integrate a ResNeSt structure and SimAM attention mechanism within the parallel convolutional component to supplement potentially missed features. The use of the SimAM attention mechanism does not add extra parameters, thereby avoiding additional computational burden and further optimizing the model’s performance and efficiency.
3.8. Noise Resistance Experiment
Considering that real manufacturing environments are often accompanied by strong noise, it becomes crucial to assess the stability and prediction accuracy of high-performance models in such contexts. To this end, a series of noise interference experiments were designed to verify the noise resistance capability of the proposed model. Specifically, to simulate extreme working conditions, Gaussian white noise with signal-to-noise ratios (SNRs) of −1 dB, −3 dB, −5 dB, −7 dB, and −9 dB was added to the original signals. Furthermore, models such as ParaCRN-AM, without attention, CNN-RNN, and RNN-CNN were selected as controls to assess their performance durability under different noise conditions. The evaluation criteria were still based on the four metrics: MAPE, MAE, RMSE, and R
2. The detailed experimental results are presented in
Figure 12.
The data in
Figure 12 reveal a clear trend: as the intensity of the noise increases, the predictive performance of all models shows a declining trend. However, among all models examined, ParaCRN-AMResNet stands out in its performance. Remarkably, even under extreme conditions with an SNR of −9 dB, the model still provides satisfactory prediction results, with corresponding MAE, MSE, R
2, and MAPE values of 13.8012, 282.4792, 0.8080, and 14.2770%, respectively.
The ParaCRN-AMResNet model is able to mine and correlate more sensitive features from signals mixed with noise, demonstrating exceptional noise resistance capability. Furthermore, the superiority of the parallel-structured model over the CNN-RNN sequential model further confirms that the proposed parallel structure can successfully avoid mutual interference between modules. In contrast, the performance of the RNN-CNN model in an SNR = −1 dB environment is even worse than a simple mean prediction; hence, its data were not included in
Figure 12. The performance comparison between the without attention model and ParaCRN-AMResNet further verifies that the introduced attention module can help the model capture key features in noisy environments. Additionally, the performance of the ParaCRN-AM model compared with ParaCRN-AMResNet demonstrates the ability of the residual block structure to help the model capture additional useful information in harsh environments.
In summary,
Figure 13 further compares the predictive results of the ParaCRN-AMResNet model with the actual outcomes, both in the presence and absence of −9 dB noise. Despite the intense background noise, the predictive results still accurately capture the trend of tool wear. This undoubtedly demonstrates the immense industrial application value of the ParaCRN-AMResNet model.
4. Conclusions
In this paper, a novel hybrid deep learning model, ParaCRN-AMResNet, is proposed for the prediction of tool wear. The raw signals are decomposed using wavelet analysis as input data. Subsequently, the model enhances the discernibility of temporally sensitive features through the incorporation of the SimAM attention layer, further employing dilated convolutional neural networks and the ResNeSt structure to capture temporally sensitive features across various scales. BiGRU is incorporated into the model, working in parallel with dilated CNN to capture time-series information. A GAP layer is applied to reduce redundant spatial features and enhance the model’s interpretability. These features are fused and used to predict tool wear. The results of conducted ablation experiments, comparative trials, and noise resistance capability tests indicate that:
- (1)
The parallel structure ensures that each feature extraction pathway operates correctly without interference from others. This approach avoids the impact of the former model on subsequent models.
- (2)
The use of dilated CNN effectively captures the intrinsic temporal correlations within time-series data, and ResNeSt additionally supplements crucial information for the convolutional component. Meanwhile, BiGRU with different sizes can effectively capture meaningful representations across various temporal dimensions.
- (3)
Experimental validation demonstrates that ParaCRN-AMResNet outperforms other deep learning models in tool wear prediction, achieving MAE, MSE, R2, and MAPE values of 2.6015, 15.1921, 0.9897, and 2.7997%, respectively.
While this study was validated on a single dataset, potentially limiting the generalizability of the model’s predictive accuracy across different conditions or datasets, and the training duration of the model hinders its immediate deployment in practical settings, future research will focus on how to effectively extend the model’s applicability through transfer learning. By transferring knowledge acquired on a specific task to related but distinct tasks, this approach not only aims to enhance the model’s adaptability but also significantly reduce the required training time and resources. This research direction seeks to amplify the practicality of the ParaCRN-AMResNet model, offering more flexible and efficient solutions for the advancement of intelligence in manufacturing.
The primary objective of this paper is not to propose an immediately applicable solution but rather to explore a promising approach aimed at enhancing the generalization performance of tool wear prediction models. By accurately predicting tool wear in real-time, proactive maintenance of CNC machining tools has been facilitated.