CNN-Transformer for Microseismic Signal Classification

Zhang, Xingli; Wang, Xiaohong; Zhang, Zihan; Wang, Zhihui

doi:10.3390/electronics12112468

Open AccessArticle

CNN-Transformer for Microseismic Signal Classification

College of Computer Science and Engineering, Shandong University of Science and Technology, Qianwan Port Road, Qingdao 266590, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(11), 2468; https://doi.org/10.3390/electronics12112468

Submission received: 24 April 2023 / Revised: 25 May 2023 / Accepted: 29 May 2023 / Published: 30 May 2023

(This article belongs to the Section Networks)

Download

Browse Figures

Versions Notes

Abstract

:

The microseismic signals of coal and rock fractures collected by underground sensors contain masses of blasting vibration signals generated by coal mine blasting, and the waveforms of the two signals are highly similar. In order to identify the true microseismic signals with a microseismic monitoring system quickly and accurately, this paper proposes a lightweight network model that combines a convolutional neural network (CNN) and transformer, named CCViT. Of these, the CNN is used to extract shallow features locally, and the transformer is used to extract deep features globally. Moreover, a modified channel attention module provides important channel information for the model and suppresses useless information. The experimental results on the dataset used in this paper show that the proposed CCViT model has significant advantages for floating point operations (FLOPs), parameter quantity, and accuracy compared to many advanced network models.

Keywords:

microseismic fracture of coal and rock; blasting vibration; deep learning; signal recognition

1. Introduction

In recent years, microseismic monitoring technology has become an effective way to monitor dynamic disasters in coal mines [1] and can monitor microseismic activities of coal rock ruptures in real time and record monitoring data. However, the coal mine environment is complex. The sensor collected signals are mixed with certain interference information, such as blasting vibrations and noises. In particular, the microseismic signals of coal rock rupture and blasting vibration signals have very similar waveforms. A microseismic monitoring system finds it hard to distinguish them. This problem affects the monitoring of microseismic activities in coal rock rupture. In turn, it affects the accuracy and timeliness of disaster warnings. It leads to inefficient mining in coal mines and makes it hard to ensure the safety of workers. Therefore, it is important to quickly and accurately identify the microseismic signal of a coal rock rupture from the mass of monitoring data. In practical engineering, obtaining effective microseismic signals is mainly performed by manual identification. Recognition chiefly relies on the experience of the engineers, which makes the recognition difficult and the efficiency low.

The commonly used methods for coal rock rupture microseismic signal identification [2] include time series analysis, machine learning, and deep learning, in which the traditional time series analysis methods contain parameter identification and time–frequency analysis. Based on the time series data to be identified, a parameter identification method selects one or more characteristic parameters from the time domain and classifies different types of signal by analyzing the feature parameters. Ma et al. [3] used source and waveform parameters as feature vectors to identify microseismic and blasting signals. Zhao et al. [4] used microseismic waveform repetition, wake drop, principal frequency, and time of occurrence as feature parameters to identify signals. The time–frequency analysis method transforms the signal sequence to obtain frequency domain or energy-related characteristic parameters for identifying a signal. Lu et al. [5] used a Fourier transform to analyze the power spectrum and amplitude–frequency characteristics of different types of microseismic signal, which provided a basis for the preliminary identification of different types of microseismic signals in mines. Wavelet analysis and wavelet packet analysis fuse the time–frequency domain information of microseismic signals, to enhance the discrimination of microseismic signals [6]. Empirical mode decomposition (EMD) decomposes the original signal into different frequency bands, to better handle random non-smooth signals [7]. Variational modal decomposition (VMD) is a non-recursive signal decomposition method. This method adaptively and efficiently separates the frequency domain part of the signal and its components [8]. Zhang et al. [9] used the VMD analysis technique to study the energy distribution characteristics and energy center of the gravity of coal rock rupture microseismic waveform and blasting vibration waveform in each modal component. Then the center of gravity coefficient of the energy distribution was used as a characteristic to classify and identify the two types of signal. Such methods rely on the experience of researchers and engineers to extract the features of signals and identify microseismic signals manually and according to characteristic parameters. Despite some successes, they still have some flaws and their efficiency needs further improvement.

The traditional machine learning method first obtains some feature parameters as input through conventional time series analysis and then uses machine learning algorithms for microseismic signal identification. Zhu et al. [10] constructed a support vector machine (SVM) network model for microseismic event classification using the fractal box count dimension of the frequency band as a signal feature for blasting, electromagnetic, and microseismic signals. Shang et al. [11] used EMD and singular value decomposition (SVD) techniques to extract mine signal features and applied SVM to discriminate between microseismic and blasting signals in the Shaba mine. Li [12] classified microseismic and blasting signals based on local mean decomposition (LMD) and pattern recognition methods. Peng et al. [13] developed a Gaussian mixed hidden-Markov model (GMM-HMM) based on mel-frequency cepstral coefficients (MFCC). Zhang et al. [14] combined VMD and multiscale singular spectrum entropy to construct feature vectors of signals and used support-vector machines to identify microseismic signals. Such methods require pre-extraction of relevant feature parameters, do not make enough use of the original data information, may miss some significant information, and are less efficient in the face of the recognition of a large number of waveform signals.

Deep learning methods automatically extract features from the input data by building a network framework and classify signals based on the extracted features. Peng et al. [15] extracted features from the signal in advance, composed the feature matrix, and then used a CNN for classification. This method still relies on the experience of research scholars and cannot automatically extract features from the original data. Ma et al. [16] used convolutional neural networks to automatically extract time and frequency domain features from time–frequency maps obtained using the short-time Fourier transform to identify microseismic signals. The designed networks performed well on their corresponding datasets. However, due to the relatively simple design of the network structure, the performance was limited for complex datasets containing a large number of signals with highly similar waveforms. ResNet [17] was proposed as a residual structure that alleviates the gradient disappearance and gradient explosion problems arising from the deepening of a network. It enables convolutional neural networks to obtain a better performance by deepening the number of network layers. Previous studies [18,19] chose to build a deeper convolutional neural network, to improve the robustness of the model, but the corresponding number of model parameters increased dramatically. Moreover, training requires large-scale data sets, and the training difficulty and training time of the model also increased substantially.

This paper aimed to design a unified lightweight classification framework with a CNN and transformer and to construct a channel attention structure for fast and accurate identification of microseismic signals. The main contributions of this paper are as follows:

A unified lightweight classification framework with a CNN and transformer is designed. The FLOPs of the model are 67% less than MobileViT [20]. The model extracts features from space, channel, local, and global perspectives. It can obtain rich feature information using only a few parameters, to classify signals quickly and accurately;
The channel attention structure is constructed so as to capture important information between channels. The module is plug-and-play. It uses a small number of parameters, to enhance the feature extraction capability of the classification framework.

2. Related Work

2.1. Lightweight Model

Considering the speed and performance requirements of the microseismic signal recognition task, building a lightweight network model is more suitable. Since Chollet [21] proposed depthwise separable convolution, which improved the computational efficiency of the convolutional layer, many researchers have applied it to build lightweight CNN networks, such as MobileNetv1 [22], MobileNetV2 [23], MobileNetv3 [24], and the ShuffleNet series models [25]. Although these lightweight networks are easy to train, the actual perceptual field of CNNs cannot cover the whole situation. When used for microseismic signal recognition, it is difficult to capture the features accurately. The transformer [26] model discarded the traditional CNN structure. It captures global contextual information in a self-attentive manner, which has an impact on various domains. Vision transformer (ViT) [27] first applies the standard transformer structure to images, but with a high dependence on pretrained weights. In addition, the transformer structure lacks the inductive bias unique to CNNs, which makes learning hard and often requires vast data sets to achieve better training results. This paper borrows the idea of MobileViT [20], to combine a CNN and transformer. The CNN provides spatial induction bias information for the transformer, which improves the stability and performance of the transformer and drastically reduces the number of parameters of the model. The transformer captures signal features from a global perspective, which makes up for the shortcomings of the CNN.

2.2. Attention Mechanism

The essence of attention mechanisms is to focus on important information and suppress useless information. Jaderberg et al. [28] proposed a spatial attention mechanism, to locate the targets that people are interested in and make transformations or obtain weights for them. SENet [29] pays attention to the channel dimension. It proposes a channel attention mechanism that enhances or suppresses various channel information. CBAM [30] proposed a mixed spatial and channel attention mechanism that pays attention to spatial and channel information at the same time. CA [31] added the positional information of height and width based on SENet, to obtain the feature information for two directions. However, all the attention mechanisms mentioned above do not apply to the microseismic signal classification task, and the model performance would instead decrease. The above attention mechanism extracts channel weights that are too generalized and do not fall on certain specific features. This paper makes the model’s attention more focused by increasing the extracted channel weights.

2.3. Discussion

This paper focuses on designing lightweight models. Towards this end, the CCViT model was devised. The model incorporates the advantages of a CNN, transformer, and channel attention. This ensures the model can accurately capture the feature information required for the microseismic signal classification and recognition task, while effectively reducing the overall computational overheads. The model is more lightweight. It uses fewer parameters and requires fewer FLOPs than other lightweight models. The model has a better performance. It performs better than other models in the microseismic signal classification identification task.

3. Method

3.1. Overall Architecture

Considering the speed and performance requirements of the microseismic signal recognition task, this paper uses a combination of a CNN and transformer to build a lightweight network model. The overall structure of the CCViT model is shown in Figure 1, where k represents the convolutional kernel size, s represents the stride, and n represents the number of transformer stacking blocks. It contains four stages, each of which is first downsampled using a pooling layer. The module in the first two stages of the model uses a CNN to extract shallow features locally and is named the CSAConv module. The module in the last two stages uses transformer as a feature extractor to globally extract deep features, named the CSAConvViT module. Finally, a classifier is used for classification. In order to realize the light weight of the model, this paper chooses to build a shallow and narrow network model. This paper sets the number of blocks stacked in each stage to 2, 2, 1, and 1; puts the focus of the model on the third stage; and sets the number of transformer stacked blocks n in the third and fourth stage modules to 5 and 3. This paper sets the channels in each stage to 16, 24, 48, and 64. The specific internal parameters of the CCViT model are shown in Table 1.

3.2. CSAConv Module

The CSAConv module uses depthwise convolution as the main body, and each convolution kernel only processes one channel, significantly reducing the computational complexity. The structure of the CSAConv module is shown in Figure 2, where H, W and C denote the height, width, and channel of the tensor, respectively; k represents the convolutional kernel size; and s represents the stride. The input feature maps

I N_{c}

are first passed through the CSA module, to obtain important channel information for obtaining the feature maps

G_{c}

. The module moves up to the depth convolution [32], which only mixes information in the spatial dimension. The i feature map is calculated as follows:

V_{i} = G_{i} * k_{i}, i = 1, 2, \dots, C

(1)

where ∗ is the convolution operation. The i channel of the output feature map V is calculated from the i channel of the feature map G with the i convolution kernel. The size of the convolution kernel k value is determined by experimental discussion. Then, the two convolutional transform functions integrate the information in higher dimensions by increasing the channel dimension first and then reducing it. They extract richer high-level features for the model. Finally, the feature maps are added to the input feature maps by shortcut branching to obtain the output feature maps. The equation is shown as follows:

O U T_{c} = I N_{c} + f_{c}^{1} (δ (f_{4 c}^{1} (L N (V_{c}))))

(2)

where the first convolution

f_{c}^{1}

expands the channel dimension, which is changed from C to

4 C

; the second convolution

f_{4 c}^{1}

reduces the channel dimension to the original; and

δ

is the sigmoid-weighted linear units (SiLU) activation function. This effectively reduces the number of operations of the model.

3.3. CSAConvViT Module

The CSAConvViT module uses a ViT encoder as a feature extractor to capture global information in an attentional manner. This solves the problem that the CNN cannot make the most of contextual information. The structure of the CSAConvViT module is shown in Figure 3, where H, W and C denote the height, width, and channel of the tensor, respectively; k represents the convolutional kernel size; s represents the stride; and n represents the number of transformer encoder stacking blocks. The input feature maps

I N_{c}

first obtain the information of the channel of interest through the CSA module, to obtain the feature maps

G_{c}

. The module first extracts features using convolutional localization, and the feature maps contain inductive bias information unique to the CNNs. The equation is shown as follows:

U_{t c} (h, w) = f_{t c}^{1} (f_{c}^{k} (I N_{c} (h, w)))

(3)

where the first convolution

f_{c}^{k}

performs local feature fusion, and the second convolution

f_{t c}^{1}

adjusts the number of channels to the

T C

required by the transformer. Then, with an Unfold -> Transformer -> Fold structure, they extract more robust features from a global perspective. First, the feature maps

U_{t c}

of size H × W ×

T C

are spread into N patches by the Unfold operation. Each patch’s size is h × w, so

P = h

× w,

N = H W / P

. Then, the global feature extraction is performed by n stacked transformer encoder modules. The equation is shown as follows:

L_{t c} (p) = T r a n s f o r m e r^{n} (U_{t c} (p))

(4)

where the value of n is obtained through experimental investigation. Finally, the feature maps

L_{t c}

are inverted and folded back to a H × W ×

T C

size in the Fold operation. A convolutional operation is used to complete feature fusion. This equation is shown as follows:

M_{c} (h, w) = L N (f_{c}^{1} (L_{t c} (h, w)))

(5)

where

f_{c}^{1}

is a 1 × 1 convolution to adjust the number of channels back to C. Then the feature maps

M_{c}

are concatenated with the input feature maps

I N_{c}

along the channel dimension. Finally, feature fusion is performed using convolutional operations. The equation is shown as follows:

O U T_{c} (h, w) = δ (f_{c}^{k} (C o n c a t (M_{c} (h, w), I N_{c} (h, w))))

(6)

where

f_{c}^{k}

is the convolution operation for local feature fusion and downsampling, to adjust the number of channels from

2 C

back to C, and

δ

is the SiLU activation function. The size of the convolution kernel k value is determined through experimental testing. This structure allows the transformer to obtain the inductive bias unique to CNNs, which drastically reduces the number of parameters required for the transformer structure. This solves the problem that the Transformer is hard to train.

3.4. CSA Attention Module

The basic principles of attentional mechanisms draw on the phenomenon where humans focus on important and interesting parts when receiving information. The attention mechanism makes the model focus on the relatively more important parts of the input features by assigning different weights to different feature parts. In order to accurately capture the inter-channel information required by the model, this paper proposes a new channel attention mechanism, named CSA. A structural diagram of the CSA module is shown in Figure 4, where H, W, and C denote the height, width, and channel of the tensor, respectively; k represents the convolution kernel size; s represents the stride; and r represents the scaling multiplier. The CSA module aggregates the input feature maps

I N_{c}

with features in the height and width directions, respectively, to obtain a pair of feature maps with location information, which helps the network to locate the target of interest more accurately. The equation is shown as follows:

X_{c}^{h} (h) = \frac{1}{W} \sum_{0 \leq j \leq W} I N_{c} (h, j)

(7)

X_{c}^{w} (w) = \frac{1}{H} \sum_{0 \leq i \leq H} I N_{c} (i, w)

(8)

where

X^{h}

and

X^{w}

are the feature maps obtained by averaging the pooling operations in the height and width directions, respectively. Then, the height direction feature is transformed into the width direction and concatenated with the width direction feature. Feature fusion is carried out through a convolution operation. The equation is shown as follows:

Y_{c / r} = δ (f_{c / r}^{1} (C o n c a t (X_{c}^{h}, X_{c}^{w})))

(9)

where

C o n c a t

is the concatenate operation in the width dimension,

f_{c / r}^{1}

is the convolution operation that scales the channel dimension to

C / r

, and

δ

is the SiLU activation function. Then

Y_{c / r}

is decomposed into two independent tensors:

Y^{h}

and

Y^{w}

, separately transformed by the convolutional transform function, to obtain the weights in the two directions. The equation is shown as follows:

Z_{c}^{h} = σ (f_{c}^{1} (Y_{c / r}^{h}))

(10)

Z_{c}^{w} = σ (f_{c}^{1} (Y_{c / r}^{w}))

(11)

where

f_{c}^{1}

is the convolution operation for adjusting the channel dimension back to C,

σ

is the sigmoid activation function, and

Z^{h}

and

Z^{w}

are the height and width direction weights, individually. The module uses two 1 × 1 convolutions to integrate the inter-channel information. The module parameters are reduced by reducing the dimension and then increasing it. Finally, the two-directional weights are multiplied by a shortcut branch with the input feature maps. The weighted feature maps are multiplied by another shortcut branch with the input feature maps, to enhance the features. The equation is shown as follows:

G_{c} (h, w) = I N_{c} (h, w) \times Z^{h} \times Z^{w} \times I N_{c} (h, w)

(12)

where

G_{c}

are the output feature maps. The CSA module is placed in front of each module in the CCViT model, providing it with significant inter-channel information, which helps improve its channel feature capture capability.

3.5. Micro Design

To address the problem that neural networks are challenging to train, normalizing layers are frequently added to a neural network to increase the accommodation capacity of the model. The most popular component in convolutional neural networks, batch normalization (BN) [33], employs the normalization of a single batch and improves the convergence of the network by decreasing overfitting in this way. However, BN has disadvantages that might affect the performance of the model [34]. On the other hand, layer normalization (LN) [35] is a method for normalizing all batches at once, with a more straightforward architecture, making it the preferred option in the NLP industry. The principle of LN is shown as follows:

L N = \frac{x - E (x)}{\sqrt{V a r (x) - ε}} \times γ + β

(13)

where

ε

is a tiny number used to prevent the denominator from being 0.

γ

defaults to 1, and

β

defaults to 0, and they can be acquired through learning. The CCViT model developed during this study abandons the widely used BN normalizing method in favor of uniform layer normalization using LN. The model significantly reduces the number of normalized layers, and only one LN is used in each CSAConv module and CSAConvViT module, lowering the model’s FLOPs.

With the development of the network, there are various activation functions proposed. Sigmoid and its combinatorial functions [36] are often used in classifiers but only perform well in classifiers. Deep convolutional neural networks’ hidden layers usually use a rectified linear unit (ReLU) function [37] to avoid gradient saturation and speed up the stochastic gradient descent method’s convergence. However, ReLU has the serious drawback that this portion of the neuron will enter the dead zone and cannot be activated once the input value is less than 0. Most models in the field of NLP use the Gaussian error linear unit (GeLU) function [38] because it is connected to stochastic regularization, which raises the output probability of the neuron. Additionally, the model performance using the GeLU function is comparable to that of the ReLU function in the image classification domain. A new activation function has been proposed for reinforcement learning: the SiLU function [39]. The SiLU function is the weighted linear combination of the Sigmoid function. The principle of the SiLU function is shown as follows:

δ = \frac{x}{1 + e^{- x}}

(14)

The SiLU function is an implicit regularizer with significant self-stability that can successfully inhibit the learning of a large number of weights, preventing network overfitting. The YOLOv7 [40] model applied the SiLU activation function and produced superior results. In this study, the SiLU function is used as the activation function of the CCViT model. The model uses fewer activation functions, only one SiLU activation function in each CSAConv module and CSAConvViT module, resulting in the better performance of the CCViT model.

3.6. Data Pre-Processing Process

In order to improve the robustness of the model and reduce the sensitivity of the model to noise, this paper used real monitoring data to train the model. The data used in this paper are from a coal mine working face in Shandong Province, with an average coal seam thickness of 7.03 m. The coal seam is straightforward and steady in its assignment. It mainly adopts heavy mechanization and high-efficiency mining methods. The hydrogeological type of the mine is complex. The mine is a rock burst mine, and the coal seam is a rock burst coal seam. The coal mining method used is the striking long-wall coal mining method, the digging process uses integrated and artillery digging, so a large number of microseismic events and blasting events are generated during the coal mining process. In order to obtain accurate blasting and microseismic events, blasting tracking recording was carried out in the mines. The recorded blasting waveforms were time calibrated with P and S waves and located to the blasting source. Then, this is compared with the recorded blasting time and blasting location. It can be confirmed whether the waveform is the signal from the current recorded blasting operation. Thus, a definitive database of blasting events can be obtained. At present, there is no method available to completely and accurately identify microseismic events in mine areas. This paper first eliminates background noise and useless data, excludes events within the blasting time, and finally analyzes the waveform characteristics of microseismic events, to build a microseismic event database with a validated method.

This paper selected microseismic and blasting events that occurred in a coal mine in Shandong Province from 1 May 2020 to 14 April 2021, to form the dataset. The dataset has 4000 single channel data records, which contain 2883 coal rock rupture microseismic signals and 1117 blasting vibration signals. Two types of signal timing diagrams are shown in Figure 5.

The lengths and magnitudes in the data set were unified. The data lengths were all 1344 samples and were normalized to compress them into the range [−1,1]. Assuming that the coal rock rupture microseismic signal or blasting vibration signal is

x (t), t = 1, 2, \dots, 1344

. The absolute maximum eigenvalue of the signal is calculated according to the following equation:

X_{m a x} = m a x (a b s (x (t)))

(15)

where

X_{m a x}

is the absolute maximum eigenvalue of signal

x (t)

. The signal normalization process is shown as follows:

X (t) = x (t) / X_{m a x}

(16)

where

X (t)

is the result of normalizing the data of signal

x (t)

. A timing diagram of the two kinds of signal after data normalization is shown in Figure 6.

This paper adopts a multi-scale wavelet transform to extract the time–frequency diagram of the two types of signal. The time–frequency diagram contains the time and frequency domain information of the signal, which indicates the change of frequency and amplitude of the signal with time and helps to classify the signal accurately. This paper chose the Cmor wavelet as a basis to transform the signal and obtained a time–frequency, diagram as shown in Figure 7, where the horizontal axis indicates time, the vertical axis indicates frequency, and the color indicates amplitude. Finally, the time–frequency diagram was resized to 224 × 224 as input to the CCViT model.

4. Experiments

4.1. Parameter Settings

In order to study the effectiveness of the CCViT model in distinguishing microseismic signals, this paper divided the dataset into a training set, verification set, and test set according to a ratio of 6:2:2. That is, the model was trained using 2400 data, tuned using 800 data, and evaluated using 800 data. The CCViT model was trained for 800 epochs from scratch, using only the most basic data enhancement strategies of random cropping and horizontal flipping. The batch size was 8. The AdamW optimizer [41] was used to update the network weights. The initial learning rate was 0.0002, and the L2 weight decay was 0.01. The cross-entropy loss function with label smoothing was used, where the smoothing rate was 0.1. This paper evaluated the model using FLOPs, number of parameters, and accuracy. FLOPs stands for floating point operations and is used to measure the complexity of a model. When a model undergoes forward propagation, there is a corresponding computational power consumption when performing operations such as convolution, pooling, layer normalization, and activation functions. The number of parameters is used to describe the model size; in other words, to describe the memory required by the model, independent of the size of the input features. The FLOPs and the number of parameters for the model were calculated by utilizing functions in the Python open-source library. Accuracy was used to evaluate the classification effectiveness of the model on the test set. In order to appraise the performance of the model more accurately, the model was trained five times in this paper, and the average of the five results was used as the final result.

4.2. Experiments on Adjusting the Model

This paper investigated the effect of the number of transformer encoder stacking blocks on the model performance, to adjust the model. Different numbers of transformer encoder stacking blocks were set in comparison experiments, and the experimental results are shown in Table 2. When the number of stacked blocks of the Transformer encoder was set to 5, 3, the model performance was the best. When the number of stacked blocks in the third or fourth stages was increased, the growth range of FLOPs and parameters of the model were roughly the same, but the model was relatively stable when the number of stacked blocks in the third stage was increased. Therefore, the center of attention of the model was in the third stage, and a moderate increase in the number of stacked blocks in the third stage helped improve the performance of the CCViT model.

Choosing the right kernel size helped to improve the models’ performance. In order to investigate the effect of different convolution kernel sizes on the CCViT model performance, this paper selected suitable convolution kernel sizes for the CCViT model through repeated experiments. First, the remaining convolutional kernel size defaulted to 3. The depthwise convolution kernels were set to different sizes for the comparison experiments. After selecting an appropriate kernel size for depthwise convolution, the suitable kernel size in the latter two stages of the convolution layers was chosen through comparison experiments. The experimental results are shown in Table 3. It can be observed from Table 3 that, as the convolution kernel size increased, the computation of the CCViT model increased, and the performance decreased. The best performance of the CCViT model was achieved when a 3 × 3 convolutional kernel was selected. The conclusion is that, for the microseismic signal classification and recognition task, a larger convolution kernel of the CCViT model is not better. Using small convolution kernels in the CCViT model instead helps the model to extract the features of microseismic signals.

4.3. Experiments on the CSA Attention Module

Where the CSA module is placed affects the CCViT model’s performance. In order to evaluate the effect of the CSA module position on the model, this paper conducted a series of experiments to explore the following:

Insert at the upper of the module in each stage, as shown in Figure 8a;
Insert in the middle of the module in each stage, as shown in Figure 8b;
Insert at the lower of the module in each stage, as shown in Figure 8c.

The experimental results are shown in Table 4. The results show that the model performed best when the CSA module was placed in the upper of the module in each stage. The model preferentially uses the CSA module to extract important channel information, which helps it to capture relevant features.

In order to evaluate the effectiveness of the proposed CSA module, this paper selected the SE, CBAM, and CA attention modules to be inserted into the same positions in the model for comparison experiments. The experimental results are shown in Table 5. Adding other attention modules to the model and adding CSA modules made little difference to the growth of FLOPs and parameters. In terms of performance, adding other attention modules made the model performance degrade or remain unchanged. These attention mechanisms do not apply to extracting microseismic signal features. Moreover, adding the CSA module improved the model performance. The feature information extracted by the CSA module is more suitable for microseismic signal classification and recognition tasks.

4.4. Comparison Experiments

In order to evaluate the efficacy of the CCViT model, this paper selected the TFMC model [16], which is also used in the field of microseismic signal classification, heavyweight networks (ResNet, Vision Transformer, and ConvNext) and lightweight networks (MobileNetv2, MobileViT) in terms of the minimum network size, for comparison experiments with the proposed CCViT model for the aspects of parameter quantities, FLOPs, and accuracy. The experimental results are shown in Table 6. The proposed model outperformed the other deep learning methods on a dataset consisting of real monitoring data, with an accuracy of 96.18%. The method proposed in this paper has the advantage of a low computational power and high accuracy compared to the other methods. Compared to the heavyweight model, the CCViT model used a small number of parameters to achieve better performance. For example, the proposed CCViT model had only 2% of ConvNext in parameter numbers and FLOPs. This indicates that using multiple perspectives to extract features makes it easier for the model to capture features, which is beneficial for recognizing signals. Compared with the lightweight model, the proposed CCViT model used fewer parameters, had fewer FLOPs, and was easier to train. For example, the proposed CCViT model had 60% of the parameters of MobileViT and 33% of the FLOPs of MobileViT. The TFMC model performed poorly on our dataset, with an accuracy of only 68.25%. Its number of parameters and FLOPs was much larger than the CCViT model proposed in this paper. This confirmed that it is difficult for simple convolutional models to distinguish between microseismic and blasting signals. The comparison experiments illustrated that the CCViT model is more suitable for the microseismic signal recognition task, with a high classification efficiency and better performance.

5. Conclusions

This paper proposed a new lightweight model, named CCViT, for fast and accurate discrimination between coal rock rupture microseismic signals and blasting vibration signals. First, the CCViT model combined a CNN and transformer to capture signal features from multiple perspectives. This approach significantly reduced the number of parameters and computation of the model, while ensuring the model performance. Second, this paper proposed a new channel attention named CSA to enhance the important features between channels. The experimental results showed that the method proposed in this paper outperformed the other advanced methods in terms of performance and had fewer FLOPs, and it could quickly identify microseismic signals. The CCViT model has a low sensitivity to noise and can extract time-domain and frequency-domain features simultaneously, which solves the complex problem of distinguishing microseismic signals from blasting signals in practical engineering, and it has good prospects for application in microseismic monitoring in coal mine engineering. We believe that this signal processing method of extracting signal features from multiple angles in the time–frequency spectrogram of a signal provides a high efficiency and can be extended to other traditional timing data-driven fields. This paper only uses single channel data for each event and cannot fully utilize the data information. In future work, we will try to combine multiple channels of data to identify microseismic events.

Author Contributions

Conceptualization, X.Z., X.W., Z.Z. and Z.W.; methodology, X.Z. and X.W.; software, X.W. and Z.Z.; validation, X.Z., X.W., Z.Z. and Z.W.; formal analysis, X.Z. and Z.W.; investigation, X.Z., X.W., Z.Z. and Z.W.; resources, X.Z., X.W., Z.Z. and Z.W.; data curation, X.Z., X.W., Z.Z. and Z.W.; writing—original draft preparation, X.Z. and X.W.; writing—review and editing, X.Z., X.W., Z.Z. and Z.W.; visualization, X.W. and Z.Z.; supervision, X.Z. and Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

The research was supported by the National Natural Science Foundation of China (grant no. 51904173), and the Natural Science Foundation of Shandong Province (grant no. ZR2022ME091).

Data Availability Statement

The datasets generated and analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

No potential conflict of interest was reported by the author.

Abbreviations

The following abbreviations are used in this manuscript:

CNN	Convolutional Neural Network
CSA	The channel attention mechanism proposed in this paper
CCViT	The model proposed in this paper
FLOPs	Floating point operations
EMD	Empirical Mode Decomposition
VMD	Variational Modal Decomposition
SVM	Support Vector Machine
SVD	Singular Value Decomposition
LMD	Local Mean Decomposition
GMM-HMM	Gaussian mixed hidden Markov model
MFCC	Mel Frequency Cepstral Coefficients
ViT	Vision Transformer
SiLU	Sigmoid-weighted Linear Units
BN	Batch Normalization
LN	Layer Normalization
ReLU	Rectified Linear Unit
GeLU	Gaussian Error Linear Unit

References

Liu, X.; Tang, C.A.; Li, L.; Lv, P.; Liu, H. Microseismic monitoring and 3D finite element analysis of the right bank slope, Dagangshan hydropower station, during reservoir impounding. Rock Mech. Rock Eng. 2017, 50, 1901–1917. [Google Scholar] [CrossRef]
Wang, J.; Prabhat, B.; Shakil, M. Review of machine learning and deep learning application in mine microseismic event classification. Min. Miner. Depos. 2021, 15, 19–26. [Google Scholar]
Ma, J.; Zhao, G.; Dong, L.; Chen, G.; Zhang, C. A comparison of mine seismic discriminators based on features of source parameters to waveform characteristics. Shock. Vib. 2015, 2015, 919143. [Google Scholar] [CrossRef]
Zhao, G.Y.; Ju, M.A.; Dong, L.J.; Li, X.B.; Chen, G.H.; Zhang, C.X. Classification of mine blasts and microseismic events using starting-up features in seismograms. Trans. Nonferrous Met. Soc. China 2015, 25, 3410–3420. [Google Scholar] [CrossRef]
Lu, C.; Dou, L.; Wu, X.; Wang, H.M.; Qin, Y.H. Frequency spectrum analysis on microseismic monitoring and signal differentiation of rock material. CJGE 2005, 27, 772–775. [Google Scholar]
Tang, S.F.; Tong, M.M.; Pan, Y.X.; He, X.; Lai, X.S. Energy spectrum coefficient analysis of wavelet features for coal rupture microseismic signal. Chin. J. Sci. Instrum. 2011, 32, 1522–1527. [Google Scholar]
Huang, N.E.; Shen, Z.; Long, S.R.; Wu, M.C.; Shih, H.H.; Zheng, Q.; Liu, H.H. The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis. Proc. Roy. Soc. 1998, 454, 903–995. [Google Scholar] [CrossRef]
Dragomiretskiy, K.; Zosso, D. Variational mode decomposition. IEEE Trans. Signal Proces. 2013, 62, 531–544. [Google Scholar] [CrossRef]
Zhang, X.L.; Jia, R.S.; Lu, X.M.; Peng, Y.J.; Zhao, W.D. Identification of blasting vibration and coal-rock fracturing microseismic signals. Appl. Geophys. 2018, 15, 280–289. [Google Scholar] [CrossRef]
Zhu, Q.; Jiang, F.; Ming, Y.; Xing, Y.; Lin, W. Classification of mine microseismic events based on wavelet-fractal method and pattern recognition. CJGE 2012, 34, 2036–2042. [Google Scholar]
Shang, X.Y.; Li, X.B.; Peng, K.; Dong, L.J.; Wang, Z.W. Feature extraction and classification of mine microseism and blast based on EMD_SVD. CJGE 2016, 38, 1849–1858. [Google Scholar]
Li, W. Feature extraction and classification method of mine microseismic signals based on lmd and pattern recognition. J. China Coal Soc. 2017, 42, 1156–1164. [Google Scholar]
Peng, P.; He, Z.; Wang, L. Automatic classification of microseismic signals based on MFCC and GMM-HMM in underground mines. Shock. Vib. 2019, 2019, 5803184. [Google Scholar] [CrossRef]
Zhang, X.; Zhao, Z.; Jia, R.; Cao, L. Identification of Microseismic Signals Based on Multiscale Singular Spectrum Entropy. Shock. Vib. 2020, 2020, 6717128. [Google Scholar] [CrossRef]
Peng, P.; He, Z.; Wang, L.; Jiang, Y. Automatic Classification of Microseismic Records in Underground Mining: A Deep Learning Approach. IEEE Access. 2020, 8, 17863–17876. [Google Scholar] [CrossRef]
Ma, C.; Ran, X.; Xu, W.; Yan, W.; Li, T.; Dai, K.; Wan, J.; Lin, Y.; Tong, K. Fine Classification Method for Massive Microseismic Signals Based on Short-Time Fourier Transform and Deep Learning. Remote Sens. 2023, 15, 502. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26–29 June 2016; pp. 770–778. [Google Scholar]
Tang, S.; Wang, J.; Tang, C. Identification of microseismic events in rock engineering by a convolutional neural network combined with an attention mechanism. Rock Mech. Rock Eng. 2021, 54, 47–69. [Google Scholar] [CrossRef]
Ma, C.; Zhang, H.; Lu, X.; Ji, X.; Li, T.; Fang, Y.; Ran, X. A novel microseismic classification model based on bimodal neurons in an artificial neural network. Tunn. Undergr. Space Technol. 2023, 131, 104791. [Google Scholar] [CrossRef]
Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. In Proceedings of the International Conference on Learning Representations, Online, 25–29 April 2022. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Adam, H. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Polosukhin, I. Attention is all you need. NIPS 2017, 30, 5998–6008. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial transformer networks. NIPS 2015, 28, 2017–2025. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Kuala Lumpur, Malaysia, 18–20 December 2021; pp. 13713–13722. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 11976–11986. [Google Scholar]
Ioffe, S. Batch renormalization: Towards reducing minibatch dependence in batch-normalized models. NIPS 2017, 30, 1–9. [Google Scholar]
Wu, Y.; Johnson, J. Rethinking “batch” in batchnorm. arXiv 2021, arXiv:2105.07576. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Han, J.; Moraga, C. The influence of the sigmoid function parameters on the speed of backpropagation learning. In Proceedings of the from Natural to Artificial Neural Computation: International Workshop on Artificial Neural Networks, Malaga-Torremolinos, Spain, 7–9 June 1995; pp. 195–201. [Google Scholar]
Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
Elfwing, S.; Uchibe, E.; Doya, K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw. 2018, 107, 3–11. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]

Figure 1. Overall structure diagram of the CCViT model.

Figure 2. CSAConv module structure diagram.

Figure 3. CSAConvViT module structure diagram.

Figure 4. CSA attention module structure diagram.

Figure 5. Typical signals for constructing a dataset: (a) Microseismic signals, (b) blasting signal.

Figure 6. Typical signal after data normalization: (a) Microseismic signals, (b) blasting signal.

Figure 7. Time–frequency diagrams of typical signals: (a) Microseismic signals, (b) blasting signal. The horizontal axis indicates time, the vertical axis indicates frequency, and the color indicates amplitude. The more yellow the color, the larger the amplitude, the more blue the color, vice versa.

Figure 8. Insert position of the CSA module: (a) At the upper of the module in each stage. (b) In the middle of the module in each stage. (c) At the lower of the module in each stage.

Table 1. Specific settings of the internal parameters of the CCViT model.

Layer Name	Input Size	Output Size	Parameter Setting
Pool 1	224 × 224	56 × 56	4 × 4, 16, stride 4
CSAConv 1	56 × 56	56 × 56	{3 × 3, 16, stride 1, group = 56 1 × 1, 64, stride 1 1 × 1, 16, stride 1} ×2
Pool 2	56 × 56	28 × 28	2 × 2, 24, stride 2
CSAConv 2	28 × 28	28 × 28	{3 × 3, 24, stride 1, group = 28 1 × 1, 96, stride 1 1 × 1, 24, stride 1} ×2
Pool 3	28 × 28	14 × 14	2 × 2, 48, stride 2
CSAConvViT 1	14 × 14	14 × 14	{3 × 3, 48, stride 1 1 × 1, 64, stride 1 Transformer ×5 1 × 1, 48, stride 1 3 × 3, 48, stride 1} ×1
Pool 4	14 × 14	7 × 7	2 × 2, 64, stride 2
CSAConvViT 2	7 × 7	7 × 7	{3 × 3, 64, stride 1 1 × 1, 80, stride 1 Transformer ×3 1 × 1, 64, stride 1 3 × 3, 64, stride 1} ×1
Conv	7 × 7	7 × 7	1 × 1, 256, stride 1
Output	7 × 7	1 × 1	Global average pool, Linear

Table 2. Research of the number of stacked blocks of the transformer encoder.

Transformer Blocks	Parameters	FLOPs	Accuracy
(3, 3)	0.50 M	82.06 M	95.03%
(4, 3)	0.54 M	89.03 M	95.35%
(4, 4)	0.59 M	92.02 M	95.48%
(5, 3)	0.57 M	95.33 M	95.58%
(5, 4)	0.62 M	98.66 M	95.43%
(5, 5)	0.67 M	101.99 M	95.28%

Table 3. Comparative experiment on the size of the convolution kernel.

	Transformer Blocks	Parameters	FLOPs	Accuracy
DWConv	3	0.56 M	89.81 M	95.90%
	5	0.57 M	92.02 M	95.68%
	7	0.57 M	95.33 M	95.58%
	9	0.57 M	99.75 M	95.28%
Conv	3	0.56 M	89.81 M	95.90%
	5	0.87 M	121.12 M	95.85%
	7	1.33 M	168.09 M	95.68%
	9	1.95 M	230.71 M	95.63%

Table 4. Research on the insertion position of CSA modules.

Position	Parameters	FLOPs	Accuracy
At the upper of the module	0.57 M	90.15 M	96.18%
In the middle of the module	0.57 M	90.15 M	95.75%
At the lower of the module	0.57 M	90.15 M	95.93%

Table 5. Compared with other attention modules.

Module	Parameters	FLOPs	Accuracy
SE	0.57 M (+0.01)	89.96 M (+0.15)	95.45% (−0.45)
CA	0.57 M (+0.01)	90.15 M (+0.34)	95.83% (−0.07)
CBAM	0.57 M (+0.01)	90.75 M (+0.95)	95.93% (+0.03)
CSA(our)	0.57 M (+0.01)	90.15 M (+0.34)	96.18% (+0.28)

Table 6. Comparison with other models.

Model	Parameters	FLOPs	Accuracy
Resnet	21.29 M	3678.23 M	94.00%
MobileNetv2	2.23 M	326.27 M	85.80%
Vision Transformer	86.24 M	16,863.45 M	91.63%
ConvNext	27.80 M	4454.77 M	95.48%
MobileViT	0.95 M	273.35 M	95.58%
TFMC	9.52 M	3533.92 M	68.25%
CCViT(our)	0.57 M	90.15 M	96.18%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, X.; Wang, X.; Zhang, Z.; Wang, Z. CNN-Transformer for Microseismic Signal Classification. Electronics 2023, 12, 2468. https://doi.org/10.3390/electronics12112468

AMA Style

Zhang X, Wang X, Zhang Z, Wang Z. CNN-Transformer for Microseismic Signal Classification. Electronics. 2023; 12(11):2468. https://doi.org/10.3390/electronics12112468

Chicago/Turabian Style

Zhang, Xingli, Xiaohong Wang, Zihan Zhang, and Zhihui Wang. 2023. "CNN-Transformer for Microseismic Signal Classification" Electronics 12, no. 11: 2468. https://doi.org/10.3390/electronics12112468

APA Style

Zhang, X., Wang, X., Zhang, Z., & Wang, Z. (2023). CNN-Transformer for Microseismic Signal Classification. Electronics, 12(11), 2468. https://doi.org/10.3390/electronics12112468

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CNN-Transformer for Microseismic Signal Classification

Abstract

1. Introduction

2. Related Work

2.1. Lightweight Model

2.2. Attention Mechanism

2.3. Discussion

3. Method

3.1. Overall Architecture

3.2. CSAConv Module

3.3. CSAConvViT Module

3.4. CSA Attention Module

3.5. Micro Design

3.6. Data Pre-Processing Process

4. Experiments

4.1. Parameter Settings

4.2. Experiments on Adjusting the Model

4.3. Experiments on the CSA Attention Module

4.4. Comparison Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI