1. Introduction
The development of automatic modulation and recognition technology has always been a crucial focus in modern military communications [
1,
2]. This technology is vital in identifying enemy jamming signals and important military information during electronic surveillance and countermeasures operations [
3]. By utilizing this technology, military forces can effectively devise targeted strategies for reconnaissance and counter-reconnaissance missions [
4].
In the field of traditional modulation recognition methods, there are two main categories: algorithms based on likelihood functions [
5,
6] and algorithms based on signal features [
7,
8]. Likelihood function-based algorithms require the accurate modeling of unknown signals, treating the modulation recognition problem as a multi-hypothetical scenario. However, these algorithms rely heavily on prior information, demanding high precision and complex calculations for the likelihood function model. On the other hand, feature-based approaches involve extracting signal features and employing classifiers for modulation recognition. While this method offers lower computational complexity and greater practicality compared to likelihood function-based approaches, it necessitates manual feature extraction and classifier design, leading to recognition outcomes heavily influenced by expert experience in feature extraction.
Compared with the limitations of traditional modulation recognition methods, such as artificial feature analysis, complex algorithms, and large computational costs [
9], the current deep learning (DL) AMR technology has brought new solutions to problems that cannot be overcome by traditional modulation recognition methods [
10]. The advantage of modulation recognition with DL is that it can automatically extract the features of the signal and obtain the classifier by training the neural network without the need to do complex and difficult artificial features and classifier design, and it has higher recognition accuracy than the traditional modulation recognition method. One of the most common methods is to process the signal into in-phase (I) and quadrature (Q) components and extract them with deep networks. In their research, S. Hong and his team introduced a cutting-edge DL-powered AMR system tailored for detecting signals in OFDM systems. By leveraging convolutional neural networks to analyze IQ samples of OFDM signals, they achieved remarkable results [
11]. Their experiments revealed that, with a signal-to-noise ratio of 10 dB, the accuracy of signal classification exceeded
. However, as the signal-to-noise ratio (SNR) dropped below 10 dB, the recognition accuracy experienced a sharp decline. Fei and colleagues [
12] took a different approach by developing the CLRD model, a dual-channel fusion network that harnesses deep learning techniques to effectively identify signals. By simultaneously capturing temporal and spatial features of the signals, their model proved to be highly efficient in signal recognition. Wang X et al. [
13] proposed a multi-stream neural network designed to extract features from various aspects of modulation signals, such as amplitude, phase, frequency, and raw data. Despite its innovative design, the recognition rate for certain modulation types fell short of expectations. Meanwhile, Lin S. and his team [
14] combined a convolutional neural network with a time–frequency attention mechanism to extract crucial features from signals.
To address the problem of poor modulation recognition in environments with low SNRs, a cutting-edge automatic modulation recognition model leveraging deep learning techniques has been developed. This innovative model combines a convolutional neural network (CNN), a bidirectional long short-term memory (BiLSTM) network, a deep neural network (DNN), and an attention mechanism. The process begins with the preprocessing of the original IQ signal using a signal processing (SP) module, which effectively eliminates unwanted additive white Gaussian noise. Subsequently, the preprocessed I/Q multi-channel signals are separated into independent I and Q data streams. These three input data streams are then fed into the CNN to capture both multi-channel and single-channel spatial characteristics of the I/Q signals. The incorporation of a spatial attention mechanism ensures that the model focuses on crucial spatial regions while disregarding irrelevant ones. Moving forward, the BiLSTM network extracts temporal features, allowing the model to understand dependencies and prioritize different parts of the sequence effectively through a time attention mechanism. Finally, a fully connected layer is used to identify the modulation mode. To evaluate the way in which this was performed in the CNN-BiLSTM-DNN framework under challenging conditions with low signal-to-noise ratios, extensive testing was conducted on the RML2016.10b and RML2016.10a public datasets.
The main contributions of the CNN-BiLSTM-DNN framework proposed in this paper are as follows:
We combine a CNN, BiLSTM, a DNN, and an attention mechanism in a hybrid neural network architecture to leverage their complementarity and synergy for extracting and classifying spatiotemporal features. The CNN is used to learn the spatial features of I/Q signals. The BiLSTM network can extract bidirectional time series features in the time dimension and effectively avoid the problems of gradient explosion and gradient vanishing, and fully connected (FC) deep neural networks achieve effective feature classification.
The signal preprocessing (SP) module is used to process the original I/Q signal, which effectively filters out additive white Gaussian noise and lays a solid foundation for subsequent feature extraction.
By including the attention mechanism module in the model, it is possible to elevate the model’s representation capabilities, minimize the interference caused by invalid targets, enhance the target of interest’s recognition effect, and ultimately elevate the model’s overall performance.
The structure of this study is as follows: the signal model and the signal preprocessing methods used in this work are introduced in the second part. The structure of the convolutional bidirectional long short-term memory deep neural networks suggested in this paper is shown in the
Section 3, along with a detailed explanation of each module’s makeup and purposes.
Section 4 describes and analyzes the experimental setup and results. The
Section 5 concludes with a summary of the study and looks at the advantages and disadvantages of the recommended approach.
3. AMR Framework
A framework of CNN-BiLSTM-DNN is proposed, as shown in
Figure 1. The CNN-BiLSTM-DNN model is composed of a CNN, BiLSTM, an attention mechanism, and a fully connected (FC) deep neural network, which can be used to extract and classify spatiotemporal features by using their complementarity and synergy. Incorporating an attention mechanism into the model can greatly enhance its ability to represent data, effectively filtering out irrelevant information and improving the recognition accuracy of the target of interest. This ultimately leads to a significant boost in overall model performance. To eliminate noise from the input signal, it is first filtered by the minimum pooling layer and the average pooling layer. Then, the I signal and the Q signal pass through two one-dimensional convolution layers, respectively, and then, the spliced signal passes through a two-dimensional convolution layer. Then, it fuses with the I/Q signal and finally extracts the spatial features of the signal through another two-dimensional convolution layer. Then, the signal passes through two bidirectional LSTM layers to extract the signal’s time characteristics. Finally, the classification results are output by the fully connected layer. Three-channel input, spatial feature mapping, temporal feature extraction, and fully integrated classification make up the functional parts of the model.
3.1. CNN
As a feedforward neural network, the CNN [
17] is composed of multi-layer neural networks, and its neurons can respond to the partial coverage of surrounding units, which has obvious advantages in local feature extraction. Convolutional layers and pooling layers make up the majority of CNNs. The core component of the CNN, the convolutional layer, uses a convolutional filter to extract the spatial aspects of the signal from the input data to learn the features and convolve with the learnable kernel. The following is a representation of the convolution operation:
where
represents the convolution output,
x is the input matrix,
is the weight matrix of size
, and
b is the bias.
Two 1D convolutional layers (Conv2 and Conv3) and three 2D convolutional layers (Conv1, Conv4, and Conv5) make up the spatial feature extraction module, which refines features for BiLSTM by reducing noise and abstracting input data at a higher level. Initially, the I/Q multiplex signal is preprocessed and split into distinct I-channel and Q-channel data streams. These three data streams are then individually processed by Conv1, Conv2, and Conv3 to capture both multi-channel and single-channel characteristics of the I/Q signal. To maintain data integrity during modeling, Conv2 and Conv3 utilize zero padding. The outputs are then combined in Concatenate2 before being fed into Conv5 for spatial feature extraction. This multi-channel input structure effectively captures representation features at various scales and maximizes the utilization of information from I-channel, Q-channel, and I/Q multi-channel data.
3.2. BiLSTM
Recurrent neural networks (RNNs) have a special variation known as long short-term memory networks (LSTMs) [
18], which include memory units, forgetting gates, input gates, and output gates, as shown in
Figure 2. These elements work together to selectively store, retrieve, and discard information within the network. The input gate regulates which input data is stored in the memory unit, and the information flow from the memory cell to external sources is managed by the output gate. The forgetting gate determines whether certain information should be retained or discarded. The specific calculation process of LSTM is as follows:
In these equations, the forget gate, input gate, and output gate’s respective outputs are denoted by the letters , , and . and represent the cell states at the current time step and the next time step, respectively. denotes the final output value. is the sigmoid function. , , , and are the weight matrices of the forget, input gate, and output gates and the current time step cell state, respectively. , , , and represent the bias terms of the gates and the current time step cell state, respectively. signifies a vector composed of the output values from the previous time step and the current time step.
The network architecture of the BiLSTM, which consists of two separate LSTMs, is depicted in
Figure 3. The input sequence is extracted by two LSTMs, forward and backward, respectively, and the extracted feature vectors are spliced as the final output features. The calculation process is as follows:
The forward LSTM and the backward LSTM cell states at time step t are represented by and respectively, WT and WV denote the weight coefficients of the forward LSTM and the backward LSTM, respectively.
While a single CNN network is adept at capturing the spatial intricacies of wireless signals, it falls short in capturing their temporal nuances. Inspired by the structure proposed in [
19], we have integrated a series BiLSTM network behind the CNN to delve into the bidirectional time series features along the temporal axis. This innovative design comprises two BiLSTM layers with 128 units each, enabling efficient processing of sequence data and the extraction of time correlations. The gate mechanism within the BiLSTM effectively tackles gradient-related challenges, elevating classification accuracy to new heights.
3.3. Time Attention Mechanism
In deep learning, the time attention mechanism is a method for handling sequential input. It allows the model to assign different importance or attention to the information at different time steps when processing sequence data. With the help of this approach, the model may more effectively understand the dependencies and significance of various sequence segments and change its attention as necessary. During initialization, this layer creates two sets of weights and is used for calculating attention scores. Attention scores are computed using Formula (14):
where
x represents the input. Softmax and the weighted inputs index and normalize the attention score to obtain the attention weight a. To emphasize the key components of the input sequence, these attention weights are applied to the input. Finally, the weighted inputs are summed along the sequence axis to produce the final output of the layer.
By means of the aforementioned mechanism, the temporal attention mechanism is able to dynamically modify the weights in accordance with the significance of various segments of the input sequence. This allows the model to concentrate on the most pertinent information at every stage, thereby enhancing its processing efficiency and highlighting the salient features of the sequence data.
3.4. Spatial Attention Mechanism
The spatial attention mechanism allows the model to adaptively learn the attention weights of various regions by incorporating an attention module, whereas the temporal attention mechanism seeks to capture the significance of time series data. In this way, the model can focus more on important areas of the image and ignore unimportant areas. Using convolution operations and the sigmoid activation function to highlight certain regional traits, the spatial attention mechanism represents the significance of input features in the spatial dimension. The working mechanism is
where
represents the attention weights obtained through an activation function,
denotes the sigmoid activation function, and
is the result of the convolution operation.
To improve the quality of feature representation, we added two FC layers, each containing 128 neurons, along with scaled exponential linear unit (SeLu) activations to improve the network design. To combat overfitting, the dropout algorithm is judiciously employed. The output layer boasts Softmax activation with 10 neurons (11 under the RadioML2016.10a dataset), with each neuron corresponding to a distinct modulation scheme. Scaled exponential linear units (SeLus) are strategically incorporated to deepen the network’s capacity. Dropout comes into play as a shield against overfitting, ensuring robust performance. The output layer determines the modulation mode of the modulated signal using Softmax.
5. Conclusions
Addressing the issue of low SNR modulation recognition accuracy, the SP module is used to denoise and preprocess the input signal to eliminate the noise influence on the I/Q data samples. A neural network model of CNN-BiLSTM-DNN is proposed, which is composed of a CNN, BiLSTM, an attention mechanism, and a fully connected deep neural network and uses their complementarity and synergy to extract and classify spatiotemporal features, which solves the problem of single spatial features and temporal features of sample signals extracted by traditional networks. The incorporation of an attention mechanism further enhances the model’s ability to discern relevant patterns within the data, effectively filtering out noise and irrelevant information. The model performs noticeably better in recognizing and categorizing target signals of interest thanks to this focused attention on important details.
Experimental results indicate that the proposed method may effectively improve the modulation recognition accuracy for low SNR wireless communication signals. Modulation recognition experiments on the benchmark datasets RML2016.10b and RML2016.10a show that the average recognition accuracies of the proposed model from −20 dB to 18 dB are and , respectively, and the improvement ranges of modulation recognition accuracy are to and to when the SNR is −10 dB to 4 dB, respectively, and the computational complexity and training time are also increased. Future work will still be needed to streamline the algorithm and increase accuracy across a few perplexing modulation patterns in order to speed up training even more.