2.1. Feature Selection
By collecting, processing, and analyzing the signal of the crusher in operation state, the feeding material classification and fault diagnosis of the crusher are studied. In this process, the first key thing to do is feature selection. The features we selected need to satisfy the following requirements:
- (1)
Significant difference. The typical features of the different feeding materials should have a significant difference, which contributes to improving the classification effectiveness and greatly reducing the calculation amount.
- (2)
Easy availability. Both data acquisition and analysis algorithms should be simple and easy to obtain, which allows rapid response to fault signals.
- (3)
Broad applicability. The algorithm proposed in this paper aims to be applicable to not only different types but also different working conditions of two-tooth roll crushers. Broad applicability is the focus in the field of crusher fault diagnosis.
Over the last few years, some researchers have focused on feeding material classification. Pan [
22] studied the audio signal of iron, wood and coal individually in the crusher cavity, and sorted the signals using a Back Propagation (BP) neural network. On this basis, Chen [
23] transformed a one-dimensional raw audio signal into a two-dimensional matrix sequence, and then the classification accuracy of the signal grayscale, time-frequency diagram, and wavelet transform was compared on the basis of LeNet-5. Yan [
24] considered the time domain signal of the audio signal to build a calculation model for the feeding material classification. Previous research has shown that the composition of feeding materials can be classified to some extent, but there is potential for improvement in accuracy and processing efficiency.
Table 1 compares a selection of salient features of the above studies.
On the whole, the audio signals during the crushing have some distinguishing features in the time domain and frequency domain; however, they are not very accurate, and there is also a lot of noise interference. Therefore, in this paper, a multi-sensor system including two acceleration sensors and one sound pressure sensor is used to reduce the monitoring error and the image of wavelet transform which includes the time-frequency domain characteristics of the collected signal is selected as a typical feature to classify the feeding materials with the help of CNN.
2.2. Spectral Subtraction
Whether in laboratory or factory environments, monitoring and fault diagnosis of crushers during equipment operation is always accompanied by environmental noise. Therefore, it is necessary to preprocess the sensor signals collected by spectral subtraction to reduce the signal interference.
As a stand-alone noise suppression algorithm, spectral subtraction is able to reduce the spectral effects of acoustically added noise in speech [
25]. By subtracting an estimate of the noise spectrum from the noisy speech spectrum, an estimation of the clean speech signal spectrum can be obtained [
26]. Generally, the estimation of the noise spectrum can be perceived during the no-load test before material feeding. According to the study reported in [
27], the formula for spectral subtraction is as shown in Equation (1).
where
x is the input signal,
is the modified signal spectrum,
is the spectrum of the input noise-corrupted speech,
is the smoothed estimate of the noise spectrum,
α is the subtraction factor and
β is the spectral floor parameter. In this way, a great reduction in background noise can be achieved with very little effect on the intelligibility of the speech.
In recent years, spectral subtraction has been used widely in the field of sound source separation [
28], fault detection [
29], speaker identification [
30], speech enhancement [
29], encrypted speech [
31], and random noise reduction [
26].
2.3. Continuous Wavelet Transforms
As a standard mathematical tool, wavelet transform (WT) is used for data analysis where features vary over different scales, and are primarily created to address the limitations of the Fourier Transform [
32]. A base wavelet is needed in order to realize wavelet transform. The wavelet is a small wave that has an oscillating wavelike characteristic and has its energy concentrated in time. WT is based on decomposing signals into shifted and scaled versions of a wavelet and can be classified into two broad classes: the continuous wavelet transform (CWT) and the discrete wavelet transform (DWT) [
33].
The CWT is a time–frequency transform, which is ideal for the analysis of non-stationary signals. Additionally, it can be used to analyze transient behavior, rapidly changing frequencies, and slowly varying behavior, which is very suitable for the research object of this article.
The CWT of a signal
is defined as shown in Equation (2) [
34,
35].
where
s represents the scale parameter,
represents the time or translation parameter,
represents the wavelet function with scale s and position offset
, and
is the complex conjugate of
.
In this paper, the CWT is used to process vibration signals after spectral subtraction and to obtain the scalogram images, which correspond to the absolute value of the CWT coefficients of a signal.
2.4. Deep Learning and Convolutional Neural Networks
Nowadays, when it comes to problems of image recognition and classification, CNNs are regarded as the first choice for solving them [
36]. Developed from machine vision, they are able to extract image features and build models automatically, overcoming the subjective influence of researchers. Moreover, they are able to improve on the accuracy and efficiency of image classification with the characteristics of weight sharing and local linking, and they have already been applied in image classification tasks in many fields, such as face recognition [
37], iris recognition [
38] in the biological field, license plate recognition [
39] in the autonomous driving field, and to determine the concentrate ash content in coal flotation prediction [
40], wet coal image classification [
41], and in applications in the mining field.
In particular, CNNs are a type of back propagation neural network with a deep structure that conducts classification tasks utilizing convolutional computation with translation invariance. Convolutional computation in the network can act as a substitute for fundamental matrix multiplication in CNNs.
CNNs primarily consist of input layers, convolutional layers, normalization layers, activation layers, pooling layers, fully connected layers, and a classification layer. In the network, different input and output layers are connected in parallel to capture image information, automatically update weights, and fulfill classification models. The specific composition is as follows:
(1) Input layers: This layer mainly pre-processes the original image data. In addition, mean-subtraction, normalization, PCA whitening, and local contrast normalization are some of the common pre-processing tools utilized. Because PCA whitening may enhance data noise, most CNN models just employ a basic mean-subtraction (and possibly normalization) step as a pre-processing step. The scaling and shifting accomplished by mean-subtraction and normalization are beneficial to gradient-based learning.
Specifically, mean-subtraction is used to make the mean value of the pixels at each position in all training images equal to zero. Given
training images, where
represents a single sample, the mean-subtraction step as shown in Equation (3).
where
x represents the input signal,
is the number of the input samples, and
is an index.
The normalization function is employed so that the data will be at the same scale. To normalize the standard deviation to a unit value, the input data are divided by the standard deviation of each input dimension determined on the basis of the training set. This can be represented as shown in Equation (4).
where
x represents the input signal,
is the number of the input simples, and
is an index.
(2) Convolutional layer: A convolutional layer is made up of a series of convolutional kernels, and each convolutional kernel can be regarded as a feature extractor. Different convolutional kernels extract different features in a complex way. The convolutional kernel is generally initialized in the form of a random decimal matrix, and reasonable weights are acquired in the process of training the network. The local receptive field is a region with the same size as the convolutional kernel in the input layer, and the convolutional result between the two is the value on a feature graph. The neuron of each convolutional layer usually contains several feature graphs, and the number of feature graphs is the depth of the convolutional kernel.
(3) Normalization layer: Batch normalization [
42] is used to normalize the mean and variance of the output activations from a CNN layer so that it follows a unit Gaussian distribution [
43]. The normalization of this distribution can be used to optimize the variance size and the mean position, and transfer the output value to the activation layer, which effectively improves the accuracy, prevents the gradient from disappearing or exploding, and accelerates network convergence. The batch normalization operation can be implemented as a layer in a CNN, as shown in Equation (5).
where
is the input of the layer,
is an index,
is the mean,
is the standard deviation,
is the standard score, and
and
are learnable variables.
(4) Activation layer: The activation function introduces a nonlinear factor to the neuron, meaning that the neural network can approach any nonlinear function arbitrarily, and thus the neural network can be applied to many nonlinear models. The Rectified Linear Unit (ReLU) activation function is the most commonly used activation function, due to its advantages of fast convergence speed, high efficiency, unilateral inhibition, relatively wide excited boundary, and better sparsity, as shown in Equation (6) [
44].
(5) Pooling layer: the pooling layer in the middle of the continuous convolutional layer is able to compress data and reduce the number of parameters, thus reducing over-fitting, and can actually be considered to be a down-sampling operation. The commonly used pooling operations are max-pooling and average-pooling. Max-pooling can be defined as the selection of the largest element value from a locally related element set, while average-pooling is defined as the calculation of the average from a set of locally relevant elements, and returning it.
Briefly, max-pooling retains texture features, while average-pooling retains the overall data features. Therefore, max-pooling was selected in order to preserve more background information of the image and provide strong model robustness in this paper.
(6) Fully connected layer: The fully connection layer connects the features of all previous layers to form output values and transmits them to the classifier. In addition, it is actually the convolutional operation where the convolutional kernel size is equal to the upper feature size.
(7) Classification layer: The classification layer performs the final classification decision, and its main function is to output the probability that the object belongs to each class. For binary classification issues, the Sigmoid function is usually employed, as shown in Equation (7), while for multi-classification problems, the Softmax function is commonly utilized, as shown in Equation (8).
where
is the output value, and the value range of Sigmoid is
.
where
is the output value,
is the number of outputs, and the value range of Softmax is
.
In this paper, considering that the research object involves coal, wood, and iron, the Softmax function will be preferred. However, because the data analysis in the testing process also studies the dichotomy problem, the two functions need to be used separately. To realize the comparison of classification results, the Softmax function is selected for the final training classification.