1. Introduction
In the field of radio communication, modulation identification techniques for communication signals have been studied and updated by many scholars, and they are of vital importance for areas such as civil and military competition [
1]. Currently, there are two main types of modulation identification techniques: modulation identification based on decision theory and modulation identification based on feature engineering [
2]. The former test statistics are complicated to compute and require some a priori information but simple discriminative rules; the latter feature extraction is simple and easy to compute but the discriminative rules are complicated. However, both methods currently suffer from high manual involvement, shallow extraction of signal features, and easy confusion of different signals [
3]. Moreover, with the increase in signal type complexity and the development of computer technology, deep learning has been shining in this field of signal modulation recognition [
4,
5,
6,
7].
Deep learning has superior recognition accuracy compared to statistical learning and hypothesis testing methods and a better ability to learn directly from high-dimensional raw input data compared to machine learning. These advantages make deep learning a popular research direction for signal recognition techniques. In the literature [
8], the authors are the first to apply the convolutional neural network model (CNN) to radio signal classification and provide a simulation dataset RML2016.10a generated based on the GNU Radio environment, on which our subsequent investigations will also be based. The literature [
9], on the other hand, investigates the deep structure of modulation recognition, and an improvement in effectiveness is obtained. The advantages of airborne deep-learning-based radio signal classification over conventional algorithms are discussed in the paper [
10]. Further, the literature [
7] trained different neural networks such as a baseline neural network and a residual neural network and obtained relatively high recognition rates.
The new neural network frameworks proposed afterward are pursuing reliability along with lower error rates and wider data processing [
11]. Since the phenomenon of gradient disappearance and gradient explosion limits the improvement of the depth of ordinary neural networks, the phenomenon of deep network degradation persists although BatchNorm has been developed to deal with this problem [
12]. It was not until the residual structure was proposed, i.e., ResNet [
13], that the degradation problem was solved. However, to reduce the error rate of neural networks, the traditional methods chosen to increase the network width and depth lead to an increase in the number of hyperparameters, which increases the design difficulty and computational effort of neural networks. To solve this problem, an optimized version of the neural network ResNeXt [
14] based on ResNet was proposed.ResNeXt adopts a residual structure and uses group convolution [
15] instead of the three-layer convolution structure of ResNet, which can improve the accuracy of the neural network while reducing the number of hyperparameters and decreasing the parameter complexity. However, it is still not effective in the face of more complex low signal-to-noise signals.
The group convolution strategy used in the ResNeXt network itself improves the recognition accuracy but weakens the generalization ability, and the confusion phenomenon is more serious at low signal-to-noise ratios. Moreover, other existing recognition methods in radio modulation signal recognition also often have poor anti-interference performance, and the signal is easy to confuse the problem. In this regard, in this paper the RFSE-ResNeXt (Residual-fusion squeeze–excitation aggregated residual for networks, RFSE-ResNeXt) network is established. By improving the residual structure of the network, the extracted deep and shallow features are fused, and the compressed excitation structure is introduced to improve the key feature weights so that the recognition accuracy of the network can be improved when facing complex signals with low signal-to-noise ratios.
2. ResNeXt Model Structure
Xie et al. [
14] proposed the ResNeXt network structure at the 2017 CVPR, which is optimized based on ResNet. The existing methods usually adopt the strategy of deepening or widening the neural network in the problem of improving the accuracy of neural network models. However, increasing the depth makes the training process more difficult and increases the difficulty of convergence of neural network parameters, while increasing the width means higher complexity and more computation, and the surge in the number of parameters often results in overfitting. To address this problem, the residual module is proposed for ResNet networks, which can effectively avoid the problem of network gradient disappearance and gradient explosion brought by the deepening of CNN networks.
The new dimensional bases are proposed so that ResNeXt performs better with the same complexity of the neural network model. The original ResNet has a base number of 2, while ResNeXt increases the base number to 32 by using parallel topology. This makes ResNeXt more accurate compared to ResNet with a close number of parameters and a significant improvement in computational efficiency.
Looking at both networks internally, the increase of the ResNeXt dimension implies an improvement of the fully connected layer, which for ResNet results in a summation of the weights × outputs. ResNeXt, on the other hand, is slightly different because it has more parallel topologies internally and thus can be represented by the following Equation (
1)
where ∂ is the weight of different topologies,
C(
a) is the output value of a flat identical topology, and n is the number of identical branches that a module has, i.e., the base of the model. The use of blocks can show this structure more intuitively.
Figure 1 shows the block structure of ResNet, and
Figure 2 shows the block structure of ResNeXt:
3. RFSE-ResNeXt Model Structure
3.1. Squeeze Excitation Structure
HU et al. proposed the squeeze–excitation (SE) structure from the channel dimension in 2017 [
16]. The core idea of the SE module is to obtain the weights of each feature channel while the network is being trained and to make the weights of the feature maps that are favorable to the loss function decrease more, while the weights of the feature maps that do not contribute or contribute little to the loss function decrease less. Thus, the effect of boosting useful features and suppressing useless features is achieved. The specific structure is shown in
Figure 3.
The SE structure is divided into a squeeze step and an excitation step.
In the squeeze step, the original feature map (H, W, C represent the height, width, and number of channels, respectively) is convolved and pooled to obtain a global average pooling (GAP) with the output dimension of 1 × 1 × C.
In the excitation step, the globally averaged features obtained in the squeeze step are fed into a two-level full connection (FC), a rectified linear unit (ReLU) [
17], and a Sigmoid function [
18] to output a channel weight vector of dimension 1 × 1 × C′ (C′ < C) of the channel weight vector. where the output layer changes from 1 × 1 × C′ to 1 × 1 × C, i.e., the weights assigned to the features represented by each channel.
Finally, the output is multiplied by the original feature map U to obtain a feature map incorporating the channel attention information .
3.2. Residual Fusion Structure
The convolutional network used in this paper is ResNeXt101 [
19], and the main body consists of four residual modules. The structure of the residual modules is shown in
Figure 4. In the input–output convolution layer, a small convolutional kernel of 1 × 1 × 1 is used, and the middle group convolutional structure uses the usual base 32 with a size of 3 × 3 × 3. The advantage of ResNeXt101 is that the group convolution is performed by stacking blocks of the same topology in one parallel, which improves the accuracy without increasing the parameter complexity.
The improved residual structure is shown in
Figure 5. Where layer1, layer2, layer3, and layer4 denote the four residual modules in the network, respectively. In the original network, the deepening of layers will make some detailed features filtered out, resulting in insufficient utilization of the extracted features. In this paper, the improved residual module structure of the ResNeXt network makes full use of the feature information extracted from each layer by fusing the detailed features extracted from the shallow network with the features extracted from the deep network.
First, the original data input is convolved with a convolution kernel size of 1 × 1, and the features obtained after convolution are fused with the output of the features from layer2 as the output of module1. Then, the features output from module1 are convolved twice with a kernel size of 1 × 1 and then fused with the features output from layer4. The effect of mutual fusion of the signal features extracted by the residual module is achieved.
The convolution operation is performed first to reduce the feature dimensionality and enable the features to be fused. In this paper, layer-hopping fusion is used instead of the layer-by-layer feature fusion. Firstly, because ResNeXt101 itself has a considerable number of parameters, if layer-by-layer fusion is used, the overall computation will be too large, and the training time will be too long. Secondly, if the layer-by-layer fusion is used, some redundant information will be fused and overfitting will be unavoidable, which will instead reduce the recognition accuracy of the network. The two features are fused using element-wise [
20].
3.3. Establishing the RFSE-ResNext Network
In this paper, the residual module in ResNeXt101 [
19] is first modified as described above, and then the squeeze–excitation structure is introduced to obtain the RFSE-ResNeXt network.
Figure 6 shows the core junction RFSE-ResNeXt block of the network. The original input of the signal is first passed through the modified residual fusion structure to fully extract the shallow and deep features. The output is then multiplied with the feature information processed by the SE structure to enhance the useful features and suppress the interference features. Each of these convolutional operations consists of convolution, batch normalization (BN) [
21], and ReLU.
The final RFSE-ResNeXt network consists of one convolutional layer, four RFSE-ResNeXt block structures, two pooling layers, one fully-connected layer, and one Softmax classifier, and the network structure is shown in
Figure 7. The network input layer determines the input data as IQ bidirectional time-domain signals with each sample data of size 2 × 128. The raw data is first input to the first convolutional layer (consisting of 50 convolutional kernels of size 2 × 8), and after the four RFSE-ResNeXt block structures and the global average pooling, the final classification is performed using the Softmax classifier.
4. Analysis of Model Results
4.1. Datasets
In this paper, we use the international common dataset RML2016.10a [
22] for the study. This dataset contains 8 classes of digital modulated signals (BPSK, 8PSK, CPFSK, GFSK, PAM4, QAM16, QAM64, QPSK) and 3 classes of analog modulated signals (AM-DSB, AM-SSB, WBFM), a total of 11 different modulated signals.
The signal-to-noise ratios of the 11 signals are distributed in the range of −20 to 18 dB with an interval of 2 dB. Each signal consists of two IQ channels with 128 sampling points at different signal-to-noise ratios. In addition, this dataset adds the influence factors such as central frequency shift, fading out, and additive Gaussian white noise to various signals in order to closely match the real environment.
For the comparative analysis of network performance, a total of 221,000 samples from the RML2016.10a dataset were used, of which 70% were used as the training set and 30% as the test set. All network models in this paper use theano back end and Keras framework. The hardware configuration is CPU Intel i7-8750CPU, GPU Nvidia GeForce RTX2080.
4.2. Impact of Residual Fusion Structure on Network Performance
As shown in
Figure 8, the improved ResNeXt network structure has a fairly good recognition accuracy overall. Although there is a slight decrease in accuracy at low signal-to-noise ratios, there is an improvement in both the middle and high segments. The reason is that the group convolution strategy of the ResNeXt network itself improves the recognition accuracy but reduces the generalization performance (i.e., the effect is reduced when facing complex signals with low signal-to-noise ratio). In contrast, in this paper, the residual structure of ResNeXt is improved so that the extracted deep and shallow features are fused with each other. Although this improvement suggests the overall performance of the network, it has little effect in compensating for the drawback of generalization performance. Therefore, we introduce a compressed excitation structure on top of the residual fusion structure to improve the generalization ability of the network.
4.3. Impact of Squeeze–Excitation Structure on Network Performance
In order to investigate the effect of the incorporated squeeze–excitation structure on the classification ability of the network, we will compare the models with and without the incorporated squeeze–excitation structure by using the parameters of global accuracy and −20 dB to 0 dB average accuracy of the network model on the test set as the evaluation criteria. The specific data are shown in
Table 1.
As can be seen from the table, the global accuracy of the model is only slightly improved after adding the squeeze–excitation structure, but the average accuracy at low signal-to-noise ratios is significantly improved. The reason is that the squeeze–excitation module provides an effective solution to the information overload problem that exists in the ResNeXt model. The SE module makes the weights of the feature maps that contribute to the loss function decline larger by obtaining the weights of each feature channel, while the weights of the feature maps that do not contribute or contribute less to the loss function decline become smaller. Thus, the accuracy and efficiency of the network are improved by boosting useful features and suppressing useless features.
4.4. Performance Analysis of Different Networks
The experimental analysis compares the RFSE-ResNeXt network, an optimized CLDNN network in the literature [
7], an improved Resnet network in the literature [
23], a CLDNN network, a vgg-cnn network, and an inception network in this paper. The effectiveness of various network models for the classification task is shown in
Figure 9.
As shown in
Figure 9, the model proposed in this paper shows a considerable improvement in recognition rate in the −20 dB to −8 dB interval even when compared with the current newer models and also has a fairly good recognition accuracy at high signal-to-noise ratios. The rFSE-ResNeXt can achieve this result, and it is important because rFSE-ResNeXt is the only model among these that has the SE module and advanced grouped convolutional structure. The former allows the network to use more energy to focus on the typical features of the signal and achieve better results with the same number of parameters; the latter allows the network to extract sample features significantly, further improving the accuracy of the network. Meanwhile, RFSE-ResNeXt further improves the ability of extracting signal features due to the improved residual network structure, which makes the deep and shallow features fuse. Because of this, it has quite good performance in the face of the close-to-reality RML2016.10A dataset used in this paper.
Figure 10 and
Figure 11 show the confusion matrix plots of ResNeXt and the network model in this paper at the signal-to-noise ratio of 0 dB, respectively, which can visually reflect the recognition effect of the model for the 11 signal modulation methods. Since ResNeXt adopts the strategy of group convolution, although it greatly improves the ability of extracting feature information, it does not perform well in the face of signals with low signal-to-noise. It can be seen that there is a problem of misidentifying some WBFM as AM-DSB at 0 dB, and confusing some 8PSK and QPSK, which leads to a low recognition rate of 8PSK, WBFM, and QPSK signals. The overall recognition accuracy of the RFSE-ResNeXt network model proposed in this paper is improved, especially since the phenomenon of confusing AM-DSB into WBFM is attenuated.
In the face of a very low signal-to-noise ratio of -10 dB, ResNeX can only identify almost one modulated signal of QAM64, and all kinds of signals are highly confused into CPFSK. while the proposed model in this paper has considerable attenuation of the confusion of AM-DSB, BPSK, BPSK, etc., which improves the overall recognition accuracy. The reason for this is that the residual fusion structure fully extracts the deep features of the signals at low signal-to-noise ratios, while the squeeze–excitation structure enables the network to learn more about the key information for correct classification.
Table 2 shows the confusion between the network in this paper and the ResNeXt network for CPFSK signals at a signal-to-noise ratio of −10 dB.