1. Introduction
Automatic modulation recognition (AMR) technology is used to identify the modulation schemes of received signals and find extensive applications in both military and civilian domains, closely related to non-cooperative communication, electronic warfare, and security detection [
1,
2,
3]. With the rapid development of wireless communication technologies, the modulation types of signals are becoming increasingly diverse and complex. Additionally, the radio frequency environment is becoming more challenging, leading to frequent occurrences of signal time–frequency overlap [
4]. Currently, research on automatic recognition of single modulation types has become relatively mature. For instance, methods employing decision theory, likelihood approaches, classification models based on domain knowledge and manually designed features [
5,
6], and neural network models based on deep learning [
7] have been developed. However, research on the recognition of signals involving two or more modulation types is less explored. There are two main approaches for modulation type recognition in the context of multiple signal mixtures: one involves preprocessing by separating signals first and then recognizing the modulation type of each isolated signal, and the other approach aims to directly recognize multiple signals without extensive preprocessing or separation.
In the first approach, the effectiveness of signal classification is closely tied to the performance of signal separation. Commonly used algorithms for multi-signal separation include Independent Component Analysis (ICA) and its variant, Fast Independent Component Analysis (FICA) [
8,
9,
10]. These methods frequently necessitate substantial prior knowledge to separate overlapping signals using channel matrices, which may be restrictive in practical applications.
Yang et al. employed a Uniform Linear Array (ULA) as a signal receiver to capture and distinguish signals. They utilized advanced techniques such as high-order cumulants and instantaneous statistical amplitude features, combined with a classical decision tree, for digital signal modulation classification. This method yielded positive results, even under conditions of low signal-to-noise ratio [
11]. However, the effectiveness of this approach heavily depends on the precise estimation of signal arrival angles, and the classification performance is closely tied to this accuracy. Moreover, the use of ULA for signal reception and separation is more appropriate for scenarios where the signal reception matrix is either positive definite or overdetermined, which limits its use in complex settings dominated by underdetermined reception matrices. Ai et al. implemented a UNet semantic segmentation network to categorize time–frequency plots of radar signals, designed to handle an unspecified number of radar signals. This technique delivered impressive Mean Intersection over Union (MioU) and Mean Pixel Accuracy (MPA) results for the modulation classification of both single and multiple radar signals in environments with high noise levels [
12]. Nonetheless, semantic segmentation networks, which depend on pixel-level identification, often encounter challenges such as high computational demands, lengthy training periods, and substantial parameter sizes. Wang et al. merged the ALOHA algorithm with a binary tree algorithm to create a machine learning-based hybrid anti-collision algorithm for analyzing mixed RF signals. This strategy successfully separates signals under both positive definite and underdetermined conditions, effectively estimating the number of RF signals and showcasing strong recognition capabilities and robustness [
13]. These techniques underscore diverse strategies for addressing the challenges in recognizing modulation types in mixed signal scenarios, each with distinct advantages and disadvantages. Further research is essential to develop robust and efficient methods for automatic modulation recognition in the complex and fluctuating conditions of radio frequency environments.
Current techniques for recognizing multiple signal modulations, which expand on the recognition of individual modulation types, have advanced considerably. Nevertheless, significant interference and overlap among signals present challenges in feature extraction for neural networks, necessitating more advanced methods to effectively retain information. Moreover, neural networks for signal classification often utilize multiple classifiers. For example, reference [
14] describes a CNN-based method for automatically recognizing mixed-signal modulations. However, a limitation of this method is the rapid increase in classification categories as the number of modulation types increases.
Inspired by fine-grained image classification networks and methods for fine-grained signal analysis [
15,
16,
17,
18], this paper introduces a recognition model designed to achieve a detailed feature analysis of signals. The objective is to develop a high-precision recognition model that can accurately classify mixed signals across various modulation schemes, providing valuable insights for subsequent digital communication system processes. The model utilizes a ResNet framework as its core architecture to extract essential information. It incorporates a Geometric Coordinate Attention Fusion module, which applies principles of information geometry to enhance pooling methods and bolster attention mechanisms. This strategy aims to thoroughly analyze feature information of samples, integrate features across different scales and modalities, and employ low-rank bilinear pooling to capture intricate signal details by amplifying differences between signal features. This enhancement boosts modulation type recognition and reduces the need for signal preprocessing, ultimately improving classification performance. To mitigate the impact of varying signal quantities on the network’s output parameters, a multi-label classification approach is adopted for signal categorization. This method significantly reduces output dimensions relative to traditional multi-class classification techniques. In the experiments, a dataset was constructed with signals from five different modulation types, including mixed signals, under various signal-to-noise ratios to validate the effectiveness of the proposed module in enhancing classification capabilities. Additionally, experiments with a robustness dataset demonstrated the model’s stability and general applicability across diverse mixed signal conditions. These research findings provide new ideas and methods for the development of automatic modulation recognition technology, with the potential to enhance the performance and reliability of wireless communication systems in practical applications. Future research will continue to focus on optimizing model structures, improving recognition accuracy, and exploring more efficient signal feature extraction and classification methods to address increasingly complex and dynamic communication environments.
The main innovations of this paper are as follows:
- (1)
The proposed Geometry Coordinate Attention Fusion Module is applied in neural networks. This module defines pooling operations based on information geometry theory to design an effective spatial dimension feature enhancement mechanism. This method provides corresponding time–frequency weights for features, resulting in more expressive feature representations.
- (2)
The low-rank bilinear pooling module is introduced to achieve cross-layer interaction between fused features and backbone features. This module obtains fine-grained feature representations of signals by mapping and aggregating between different features.
The remainder of this article is as follows. In
Section 2, we introduce the signal reception model, the derivation of the information geometry model, and related work, including the backbone neural networks we used. In
Section 3, we present our proposed model framework and its underlying principles. In
Section 4, we evaluate various performance metrics of the model through experiments and conduct a comparative analysis. In
Section 5, we summarize the strengths and weaknesses of the proposed model and provide an outlook on future development directions.
3. Geometry Coordinate Attention and Low-Rank Bilinear Pooling Network
The structure of the geometric coordinated attention and low-rank bilinear pooling network constructed in this article is shown in
Figure 1. This paper uses Resnet-50 as the backbone network, adds geometric coordination attention to further improve the feature learning ability of the convolutional network, and then introduces a low-rank bilinear pooling module to achieve fine-grained analysis of signal sample features. Finally, multi-label classification is used to replace the multi-class output sample prediction values. In this section, we first introduce the principle of the geometric coordination attention module, and then introduce the multi-modal feature fine-grained fusion method of the low-rank bilinear pooling module.
3.1. Geometric Coordinated Attention Module
In order to effectively retain multi-level feature information, this paper designs a weighted fusion module of multi-scale features. Extract low-, medium-, and high-level features from the backbone network, so that features of different scales are optimized and fused in the time–frequency domain. The design structure of this module is shown in
Figure 2.
First, the low-, medium-, and high-level features x1, x2, and x3 of the received signal are extracted from different layers of the backbone network. Feed the features individually into GCA, where H, W, and C represent the height dimension, width dimension, and channel dimension of the feature, respectively. Since the input feature of the neural network is the time–frequency feature plot of the signal, the width dimension and height dimension of the feature can be regarded as the time and frequency dimensions of the feature. Then, in the GCA, the time–frequency dimension weight information of multi-scale modal features is extracted and weighted, and finally, the multi-scale modal features are fused.
In GCA, the pooling operation
is defined according to Equation (18) as
where
;
and
are the mean and variance of the input features in a certain dimension. The pooling operation method defined by Equation (19) comprehensively considers the statistical and geometric characteristics of features, and can extract more feature information than the traditional average pooling operation and maximum pooling operation.
The input features encode the dimensions along the horizontal and vertical axes through two pooling operations to generate a pair of dimension-aware feature sets. This transformation can help locate the features in spatial locations, and correlate features in time and frequency. The mathematical expression of vertical pooling coding (H Pool) is as follows:
where
is the feature of the received signal feature in each channel dimension of the neural network,
is the minimum granularity of the height dimension,
is the minimum granularity of the channel dimension, and
is the height direction feature of the sample in the channel dimension and is also a frequency characteristic. The mathematical expression of horizontal pooling coding (W Pool) is as follows:
where
w is the minimum granularity of the width dimension and
yW ∈
R1×W×C is the width direction feature of the sample in the channel dimension and also the time feature.
Time and frequency features are concatenated in the channel dimension, and the convolution function with a convolution kernel size of (1,1) is used to obtain the time–frequency joint intermediate state features. Then, the batch normalization and a non-linear activation function are applied to reduce differences among the joint features.
where
r is used to control the GCA parameter amount.
Then, the joint intermediate features along the channel dimension are reshaped into two features:
and
. Over-sampling operations using convolutional functions with kernel size (1,1) are applied on each feature to restore the parameter dimensions equal to the input features. Finally, a non-linear activation function is applied to each feature to obtain weights or coefficients along the time and frequency dimensions.
where
(·) is the sigmoid function operation, which performs normalization and activation.
Finally, multiply the input features of different scales by their corresponding feature weight coefficients along the dimensions to obtain enhanced weighted features along the time and frequency dimensions. Additionally, normalize the scaled weighted features to achieve feature fusion with unified scale parameters.
where
is the weighted feature obtained by multiplying the corresponding input feature with its weight coefficient.
where the ⊕ operation is element-wise addition. The variable
is the final output, which is a multi-scale weighted fusion feature.
3.2. Low-Rank Bilinear Pooling Module
This paper leverages a method involving the outer product of extracted neural network features from two modalities to perform weighted fusion of features, obtaining bilinear characteristics of the signal to enhance the network’s ability to analyze fine-grained features and consequently improve its capability for signal modulation recognition. Addressing the issue of parameter explosion resulting from bilinear pooling operations, the paper employs low-rank bilinear pooling to optimize the operation and reduce parameter count. Using Hadamard product in place of the outer product to interact with different features, the corresponding relationships are captured. This approach significantly reduces parameter count at the expense of certain computational complexity. In the final fully connected layer, a staged reduction in parameters is employed alongside introducing dropout to mitigate the risk of overfitting.
The bilinear pooling operation is shown in
Figure 3, and its mathematical expression is
where
and
are two features that are mutually mapped.
is the backbone feature obtained from input samples through the backbone architecture, while
is the fused features output by the Geometric Coordinate Attention module from the high-, middle-, and low-level features of the samples.
and
are the channel dimensions of the two features,
is the mapping between the channel dimensions of the two features,
is the weight matrix,
is the bias, and
represents the output features.
Perform matrix decomposition on the weight matrix as follows:
where
and
are low-rank decomposition matrices of
and
is the joint embedding dimension. Equation (28) can be further expressed in Hadamard product form, with a mathematical expression as follows:
where
denotes the Hadamard product and
is the output feature after the mathematical expression transformation.
Using two low-rank matrices, and , to approximate avoids the direct computation of the outer product of the two features as in the original bilinear pooling method. This reduction in computation and parameters decreases the parameter count from to . It can be observed that the parameter count is controlled jointly by the joint embedding dimension m and the channel numbers of the two input features. Therefore, while ensuring the features remain unchanged, it is possible to attempt multi-channel merging to reconstruct feature parameter scales, reduce the number of channels, and increase computational complexity to decrease parameter count, ensuring efficient network operation.
In this module, the mutual mapping between two sets of input features is used to extract joint representations of features across channels, enabling multi-modal bilinear pooling for improved neural network performance in signal modulation recognition within the same sample. Additionally, the use of low-rank matrix approximation for outer product computations addresses the issue of excessive parameter count in bilinear pooling, thereby mitigating potential inefficiencies and overfitting effects in neural network operations.
3.3. Loss Function and Evaluation Metrics
This paper’s model uses a multi-label classification approach for signal classification, which consists of multiple binary classifiers combined together. Sample labels are encoded using one-hot encoding. The normalization function connected to the model’s output layer is replaced with the sigmoid function instead of the Softmax function. The model employs a combination of the cross-entropy loss function and the binary cross-entropy loss (BCEloss) for multi-label binary classification. Compared to traditional multi-class algorithms, this method effectively reduces label dimensions and optimally utilizes the feature space of output predictions. The BCEloss not only considers the situation when the prediction value is 1 but also incorporates the impact of the prediction value being 0 on the loss. The expression for the BCEloss is as follows:
where
is the true value of the
i-th label in the
j-th sample, and
is the predicted value of the
i-th label in the
j-th sample.
is the total number of samples, which is equivalent to the batch size during batch training.
is the number of sample labels, which is equivalent to the prediction output dimension.
The evaluation indicators of the experiment are based on ref. [
21], which uses the exact match ratio (EMR), hamming score (HS), and F-Measure (F1). The three indicators are used as a specific evaluation of the model’s ability to identify signal modulation types. The exact match ratio is the strictest measure of model recognition effectiveness, representing the ability of model identification to be completely accurate, but partial accuracy is also part of model evaluation. Therefore, hamming score and F-Measure are included as evaluation metrics. The hamming score refers to the accuracy feedback of the overall sample, representing the model’s perception of all labels. The F-Measure is a comprehensive measure of sample recognition accuracy, balancing precision and recall, representing the balance of the model.
The calculation formula for the exact match ratio (EMR) is as follows:
where
is the predicted value set for the
j-th sample,
is the true value set for the
j-th sample, and
is a function that returns 1 when the predicted value is exactly equal to the true value, otherwise it returns 0. Accuracy represents the average accuracy across all samples, where for each individual sample, it indicates the proportion of signal categories where both predicted and true values are true among the total signal categories predicted as true or true in reality.
The hamming score calculation formula is as follows:
The F-Measure calculation formula is
4. Simulation Experiment and Performance Analysis
4.1. Dataset
In this experiment, a single-antenna reception method was simulated under complex environments by adding colored noise to sample signals to better reflect real-world conditions. The dataset consists of single-signal samples or mixed-signal samples with a signal-to-noise ratio (SNR) spanning from −12 dB to 8 dB in 2 dB increments. The signal sampling frequency is 1 × 10
7 Hz, with a frequency range from 0.5 × 10
7 Hz to 3 × 10
7 Hz. The resulting sample data format is a 224 × 224 RGB matrix. The mixing of signals is based on fifteen combinations as listed in
Table 1, with each type of sample generated at various SNR levels resulting in 240 samples per combination, totaling 39,600 signal time–frequency energy plots. The mixing levels among different signal types are evenly distributed. The dataset is divided into training and test sets with a ratio of 5:1.
4.2. Experiment Environment
The experimental environment configuration for this paper is as follows: the computer system used was Ubuntu 16.04, with a TAITAN graphics card and 24 GB of memory. The programming language used was Python 3.8, the deep learning framework was PyTorch 1.7, and the optimizer was Adaptive Moment Estimation optimizer. For training, the approach involved training from scratch, with an initial learning rate set to 0.001 and adaptive decay. The training was conducted for 200 epochs, and the batch size was set to 32 based on network parameters and GPU memory size.
4.3. Ablation Study
To verify the reliability of the proposed modules in this paper, four ablation experiments were conducted using the same dataset and experimental setup with identical evaluation metrics. The neural network model containing both the Geometry Coordinate Attention Fusion Module and the low-rank bilinear pooling module was named GCABNet; the model containing only the Geometry Coordinate Attention Fusion Module was named GCANet; the model containing only the low-rank bilinear pooling module was named BNet; and the backbone network model was named InitNet. In the BNet model, both input variables in the bilinear pooling module use feature x. Additionally, a model named GCABVGG, which integrates the proposed modules into the VGG16 network structure, was compared with the VGG16 network (VGG).
The results of the ablation experiments under dynamic SNR conditions depicted in
Figure 4a–c indicate that the GCA Module and the low-rank bilinear pooling module proposed in this paper led to an approximately 9% increase in EMR and a 5% increase in HS under low SNR conditions. The model’s recognition capability improved under high SNR conditions, with an EMR of over 98% for signal modulation type recognition at SNRs above 2 dB and approaching 100% at SNRs exceeding 4 dB. Specifically, the low-rank bilinear pooling module achieved an approximate 6% increase in EMR under low SNR conditions, demonstrating that bilinear pooling can yield more effective discriminative features at a finer granularity, significantly impacting the model’s recognition capability. Furthermore, the Geometric Coordinate Attention accurately obtained time–frequency dimension weights from multiple scales of sample features, effectively retaining sample feature information and assisting the model in capturing underlying discriminative features. Therefore, the proposed modules in this paper show feasibility in improving the recognition of signal modulation types.
Moreover, as shown in the experimental results in
Figure 4d-f, integrating the proposed modules into the VGG network as the backbone structure resulted in a 7% improvement in EMR and approximately 9% improvement in HS under low SNR conditions, with improved balance in the F-Measure, further demonstrating the reliability of the proposed modules in this paper.
4.4. Attention Comparison Experiment
To verify the advantages of obtaining more effective time–frequency weight coefficients using the Geometric Coordinate Attention proposed in this paper’s GCA Fusion Module, this section conducts six comparative experiments by replacing the GCA part of the GCA Fusion Module with CA, SAM, BAM, CAM, CBAM, and SENet, while keeping all other conditions identical in the experiment environment. The contrasting performance of different attention on the dataset in terms of recognition capability and classification effectiveness is evaluated.
Specifically, CA (Coordinate Attention) analyzes the input features in multiple dimensions through multiple independent attention heads and outputs weight coefficients after normalization. SAM (Spatial Attention Module) analyzes both height and width dimensions, enhancing the model’s ability to process spatial information. CAM (Channel Attention Module) extracts channel dimension information of features, enhancing the model’s ability to learn dependencies between different channels. BAM (Bottleneck Attention Module) is a parallel combination of CAM and SAM attention mechanisms. CBAM (Convolutional Block Attention Module) is a series combination of CAM and SAM attention mechanisms. SENet (Squeeze-and-Excitation Network) enhances the response of important feature channels while suppressing the response of unimportant feature channels.
The experimental results in
Table 2 demonstrate that the GCA Module used in this paper exhibits superior performance compared to classical attention. The GCA used in this paper outperforms other attention mechanisms in terms of both EMR and HS, with up to a 3% performance advantage in EMR and a 1% performance advantage in HS. Furthermore, the F-Measure results indicate that the stability of the mechanism proposed in this paper is also advantageous compared to other mechanisms. These results demonstrate that the GCA proposed in this paper maximizes the extraction of signal feature weight coefficients based on information geometry theory, enhancing the learning ability of neural networks and thereby improving model recognition capabilities. Compared to other mechanisms, the GCA considers both local granularity information in the time–frequency domain and integrates channel weights, retaining more feature information and enhancing the separability of signal features, proving the feasibility and importance of capturing time–frequency weight information. The comparison between the proposed GCA and the CA concludes that attention mechanisms based on information geometry theory can more effectively retain sample feature information, thus improving model recognition capabilities.
4.5. Robustness Experiment
To verify the robustness of the signal recognition model proposed in this paper, a supplementary dataset was created in this section. It includes additional mixed-signal combinations not present in the original dataset from
Section 4.1. Under the same experimental conditions, this supplementary dataset was used as the test set for further experiments to validate the reliability of the signal recognition model. The signal combinations in the supplementary dataset are shown in
Table 3. The robustness dataset comprises mixed signals with signal-to-noise ratios (SNRs) ranging from −12 dB to 8 dB in steps of 2 dB. Each type of sample was generated 40 times at each SNR level. In the robustness experiment, a confusion matrix was used to represent the experimental results. In this matrix, aside from signal mixture types serving as labels, the “X” label was introduced to indicate that the model recognized a signal combination not present in the corresponding dataset mixture. Additionally, the “NULL” label denotes cases where the model did not recognize any signals.
Figure 5 shows the confusion matrix obtained by randomly selecting 80 signal samples for each mixed type from the original dataset at various signal-to-noise ratios (SNRs). It reveals that the EMR for signals reaches 84.6%, with EMR exceeding 92% for single signals and nearing 80% for multiple signals. The HS is concluded to be 91.1%. It is also observed that as the number of coexisting mixed signals within a sample type increases, the model’s ability to recognize signal types weakens. This is due to the increased number of signals under the same noise conditions, causing energy dispersion in the time–frequency domain and resulting in blurred features that reduce separability and increase recognition difficulty.
In
Figure 6, the EMR for signals from the supplementary dataset is only 47.2%, while the HS reaches 84.6%. The supplementary dataset contains a significantly higher number of mixed-signal samples compared to the original dataset, resulting in a noticeable decrease in the EMR due to the increased complexity of mixed samples. However, the HS approaches that of the original dataset. Considering the EMR, HS, and analysis of the confusion matrix, the model’s ability to recognize signal types in the supplementary dataset is comparable to that in the original dataset. However, there is a discernible limitation in distinguishing between different types of signal combinations, particularly affecting the recognition performance when QAM and BPSK signal types coexist. Therefore, under the same experimental conditions, introducing supplementary dataset samples with a 40:1 ratio in the training set enhances the model’s learning of different types of combinations and leads to a new supplementary confusion matrix.
In
Figure 7, the EMR of the samples reaches 73%, with an HS of 90.3%. The majority of samples that did not achieve an absolute match were due to misidentifications of a single signal type, indicating a certain level of accuracy. Therefore, the model in this study demonstrates a certain capability and robustness in recognizing signal combinations outside of the training samples. After training with small samples of unknown combinations, the recognition capability improves significantly, with an approximately 17% increase in EMR and a 5.7% increase in HS, approaching the model’s accuracy on the original dataset.