1. Introduction
Signal modulation recognition is a key technology of great significance in both civilian and military fields. In the civilian field, it is widely used in the monitoring and management of the electromagnetic spectrum to ensure communication security and the effective control of signals. In the military context, it is used to assess the threat level of enemy equipment and assist in the operation of military reconnaissance equipment by identifying the types and parameters of electromagnetic signals [
1,
2]. However, with the rapid development of information technology, the current electromagnetic environment is increasingly complex [
3], which brings huge challenges to radar signal modulation identification technology, especially in the case of dense signals, multi-component signals may be generated. Therefore, effective analysis and identification of multi-component modulated radar signals under different signal-to-noise ratios (SNR) conditions is a more practical and challenging task.
Traditional radar signal recognition methods are divided into two types: those based on decision theory and those based on feature extraction. Methods based on decision theory [
4,
5] transform the signal identification problem into a hypothesis testing problem, relying on probability theory and Bayesian estimation theory. However, this method has strict requirements on the prior information of the signal and the algorithm is computationally intensive, which is not conducive to blind identification. Existing algorithms based on feature extraction have low computational complexity, by converting signal preprocessing into a certain transformation domain, and using classifiers for training and learning, such as artificial neural networks [
6] and support vector machines [
7]. However, the performance of algorithms based on feature extraction depends on the quality of the extracted features, and the selection of features often relies on the experience of the researcher and can only target relatively ideal conditions [
8].
Signal modulation recognition based on deep learning (DL) can adaptively extract optimal features and greatly improve the recognition performance [
9,
10,
11,
12]. Many researchers have studied long short-term memory (LSTM), one-dimensional convolution, and deep neural network (DNN) algorithms for the feature extraction and classification of single-component time-domain radar signal modulation recognition tasks, and achieved good performance under low SNR conditions [
13,
14]. With the rapid development of CNN in the field of image processing, research on converting one-dimensional time-domain radar signals into two-dimensional time-frequency images (TFIs), and then using CNN networks to extract and classify radar signal time-frequency images has also received increasing attention. Jiang et al. [
15] used multi-layer decomposition to denoise the signal and obtained the signal time–frequency image through Choi–Williams distribution (CWD). Finally, they used an improved convolutional neural network to classify 12 types of signals. In another study [
16], the author composed three complex modules based on CNN into LPI-Net, which was used to learn the texture features of CWD time–frequency images. In addition, Jiang et al. [
17] used smoothed pseudo WVD (SPWVD) to convert radar signals into TFIs. Then, the local dense connection U-net is introduced as a denoising network to denoise the time–frequency image to assist the deep convolutional neural network (DCNN) network for recognition. In order to obtain high-quality TFIs, image processing techniques such as image filtering and cutting were used in the literature [
18] to remove background noise and redundant frequency bands, and obtain grayscale images containing the main morphological features. The preprocessed image is then input into the ACDCA-ResNeXt network to achieve recognition.
However, as can be seen from the above, its limitation is that the classification of intra-pulse modulation mainly focuses on single-component intra-pulse modulation, and the conditions are relatively ideal. With the increasing complexity of intra-pulse modulation technology, during the process of electronic countermeasures, the receiver may receive multiple signals at the same time, causing the received signals to overlap in the time domain and frequency domain. Therefore, it is necessary to consider the occurrence of multi-component intra-pulse modulation in the electromagnetic space. In recent years, some researchers have proposed methods such as blind source separation [
19] and parameterized time–frequency analysis (TFA) [
20], which have achieved good performance. In recent years, with the rapid development of deep learning technology, some researchers have proposed a multi-component radar signal modulation recognition method based on deep learning. Among them, the research on multi-component radar signal recognition using the time–frequency images of radar signals combined with CNN has become a hot topic. Pan et al. [
21] designed a multi-instance multi-label learning framework based on deep CNN, combining SPWVD time–frequency diagram with MIML-DCNN to realize the recognition of simulated overlapping signals of four different modulation types. SI et al. [
22] proposed a new multi-class learning framework based on SPWVD and DCNN, which enhanced the connection between network modules by introducing the MBconv module and reduced the transmission loss of features. In order to better highlight the time–frequency characteristics of radar signals, ref. [
23] uses the improved Cohen-type time–frequency distribution (CTFD) to generate time–frequency images. Then, three semantic segmentation networks, fully convolutional neural network (FCN-8s), U-Net, and DeepLab V3, were used to separate and identify the signals. Under low SNR, time–frequency images are susceptible to noise contamination. In ref. [
24], the ResSwinT network is employed to denoise and reconstruct dual-component time–frequency images at various SNRs, and the SwinT network is used to recognize dual-component radar signals with random combinations among 12 modulation formats.
In summary, most of the existing multi-component modulation recognition works target TFIs for recognition, using CNNs as the backbone feature extraction network, commonly adopting the TFI-CNN model. Therefore, TFA has become an indispensable part of multi-component radar signal research. TFA has been developed over several decades, producing many classic algorithms. Ebrahim Ghaderpour [
25] demonstrated the potential of the least squares wavelet (LSWAVE) software (
https://www.mathworks.com/matlabcentral/fileexchange/70526-lswave-signalprocessing, accessed on 9 March 2025) [
26] for analyzing VLBI time series and coherence analysis based on least squares wavelet analysis (LSWA). Mateusz et al. [
27] studied the performance of empirical mode decomposition (EMD) and singular spectrum analysis (SSA) in the detection of aerodynamic instability in centrifugal compressors. Liu et al. [
28] used spectrum reconstruction technology to study a denoising method for random noise in active source seismic data. TFA such as short-time Fourier transform (STFT), Wigner–Ville distribution (WVD), and SPWVD are also widely used in various fields. Some existing multi-component radar modulation recognition works directly use these TFA algorithms to obtain TFIs and then focus mainly on the preprocessing of TFIs and subsequent tasks. However, the process of converting time-domain multi-component signals into TFIs is often overlooked. For instance, refs. [
21,
22] applied SPWVD transform in recognition systems, while ref. [
24] uses a denoising network to denoise and reconstruct TFIs. Although these methods can improve the recognition performance of multi-component radar signals, they also have the following issues:
- 1.
Existing TFI-based multi-component radar signal recognition works all use the TFI-CNN recognition framework. During the testing phase, the entire recognition process includes multiple independent steps such as time–frequency transformation (TFT), denoising, feature extraction, and classification, which increases the risk of introducing errors, and no end-to-end system has been formed to implement this process.
- 2.
It is difficult to obtain clear TFIs under low SNR conditions. The trend of the time–frequency ridges in the TFI reflects the changes in the signal’s instantaneous frequency over time, while different forms of time–frequency ridges display the intra-pulse characteristics of different signals. Traditional TFA is susceptible to noise interference, and under low SNR conditions, the time–frequency ridges of the signal in the TFI can easily become distorted. This blurs the intra-pulse characteristics and texture features of the signal in the time–frequency domain, thereby affecting recognition.
- 3.
Recent works [
22,
23,
24] have utilized advanced networks to denoise and reconstruct features of the TFIs, thereby obtaining clear TFIs. These denoising networks typically operate on the transformed noisy time–frequency images. Although this method improves accuracy compared to the traditional TFI-CNN method, the subsequent recognition network does not fully utilize the original time-domain information throughout the recognition process, but only works on the denoised TFI. Additionally, under low SNR conditions, some signals often fail to exhibit complete intra-pulse characteristics in the time–frequency domain due to noise interference. This can lead to confusion between signal intra-pulse characteristics and noise, causing the denoising network to misinterpret the signal.
- 4.
Most current work uses deep CNNs to classify radar signal TFI. Convolutional networks are both local and translation invariant, allowing convolution operations to learn more edges and higher-level local features of objects in images. However, the locality of convolution operations also limits their ability to learn global features in images and signal location information features, which are critical for obtaining satisfactory recognition results at low SNR.
Therefore, we propose the TFGM-RMNet dual-component radar signal recognition framework. Unlike the previous TFI-CNN multi-component recognition scheme, TFGM-RMNet automatically learns to generate TFRs through the TFGM module for multi-component radar signals, effectively alleviating the low-quality TFI generated by traditional TFA under low SNR conditions. Specifically, this framework mainly consists of a deep time-frequency generation module and a classification module. The noise signal is preprocessed first and then used as the input sample for TFGM, with the noise-free TFI as the learning target. During this process, the TFGM module guides the network weights to adaptively learn basis functions to obtain various TF features and reconstruct them to generate the TFR. Due to the supervision from clean TFI, this also endows the TFGM module with the ability to perform automatic denoising. Finally, RMNet, which combines ResNet and Transformer learning, fully extracts local and global features from the TFR, and ultimately outputs category predictions. The main contributions of this paper can be summarized as follows:
- 1.
For dual-component modulation recognition, a TFGM-RMNet network is proposed, which embeds the deep learning-based TFA module TFGM into the recognition network to replace the traditional TFA. In the testing phase, the end-to-end TFGM-RMNet directly generates the recognition results, avoiding the step-by-step operation in the traditional method and eliminating the need to design a denoising network, thereby achieving a performance improvement compared to the traditional multi-class radar signal recognition scheme.
- 2.
TFGM consists of reduction, encoder, and decoder, which are responsible for frequency domain feature mapping, feature extraction, and reconstruction, respectively. The reduction module extracts the frequency domain features of the time domain signal by adaptively learning the basis function through convolution weights. The encoder and decoder aggregate the time–frequency features to generate TFR. In order to improve the quality of the generated TFR, we use mean square error (MSE) loss and perceptual loss for training to improve the pixel and structural similarity of TFR.
- 3.
The classification network of our model adopts a hybrid design combining local convolution and global self-attention. Unlike the vision transformer (ViT), which divides images into blocks as input, our model replaces convolution with a cascaded multi-head self-attention (MHSA) layer to achieve global self-attention on the convolved 2D feature map, alleviating the problem of a lack of biases in the hidden layers of CNN that affect the recognition accuracy.
- 4.
We conduct extensive experiments to verify the effectiveness of the proposed framework and compare the recognition accuracy of our model with existing work. Experimental results show that our model has good recognition performance under low SNR.
The remainder of this paper is organized as follows.
Section 2 introduces the mathematical model of multi-component radar signals and the reassignment SPWVD (RSPWVD) algorithm.
Section 3 describes the proposed TFGM-RMNet framework and explains each module in detail. Subsequently, in
Section 4, we present the experimental dataset, implementation, and results. In
Section 5, we provide a discussion. Finally, in
Section 6, we conclude this paper.
3. Methodology
Our dual-component radar signal recognition model, denoted by TFGM-RMNet, utilizes a TFGM module to replace the traditional TFT method to generate TF features in the time–frequency domain. It employs RMNet as the feature extraction module to extract local and global features from the TFR generated by TFGM. During the training phase of the model, we first preprocess the input noisy radar signal by normalization. Subsequently, the network weights learned by TFGM are used as basis functions to extract signal TF features, and the time–frequency domain features are reconstructed to generate clean TFR. In this process, we use RSPWVD to convert the time–frequency images of the clean dual-component radar signal as labels to facilitate the learning of the TFGM module. Finally, RMNet is used to extract deep features from the TFR. In this process, we use multiple loss functions to evaluate this multi-task learning model, thereby improving the model’s ability for time–frequency energy aggregation and classification performance. During the testing phase of the model, there is no need to use TFT tools, achieving a direct end-to-end process from input time-domain signals to output classification results.
3.1. Framework
Our dual-component radar signal recognition framework is shown in
Figure 2. It consists of a time–frequency feature generation module and a multi-label modulation recognition module.
The time-frequency feature generation module consists of a reduction module, an encoder, a decoder, and an upsampling layer. It extracts features from the original time domain signal and outputs the TFR of the signal. Using the TFGM module, there is no need to explicitly display and process the time-frequency image during the recognition process. The process from signal(noise)-TFR(clear) can be directly implemented, without the need to go through signal(noise)-TFI(noise)-TFI(denoise)-TFR(clear) as in existing work. The obtained TFR vector can be directly output to the next recognition module, building an end-to-end recognition process.
The multi-label modulation recognition module consists of RMNet and a multi-label classification layer. RMNet uses the locality and translation invariance of convolution to extract the underlying features of the input image, enhance the generalization ability of the overall model, and reduce the dependence of the subsequent attention module on the amount of data. The cascaded multi-head self-attention mechanism based on Transformer improves the quality of capturing global image feature context information. The multi-label classification layer maps the extracted features to each possible modulation category and outputs the predicted probability of each category. The predicted probability of each label can be mapped to the [0, 1] interval using the sigmoid function, indicating the probability of each category. By setting an appropriate threshold, the existence of each label can be judged, thereby achieving multi-label classification and the recognition of modulated signals.
3.2. Time–Frequency Representation Generation
We constructed the TFGM module in the dual-component modulation recognition framework. TFGM learns TFT through reduction, encoder and decoder to obtain the ability to generate clean TFR.
Figure 3 shows the reduction module, which mainly describes how the convolution operation is analogous to the short-time Fourier transform (STFT) to learn the Fourier basis to implement convolution to extract the frequency domain features of the time domain signal.
Figure 4 illustrates the encoder and decoder, whose functions are to aggregate and reconstruct the time–frequency feature maps extracted by reduction [
30].
3.2.1. Reduction
The convolution operation in a neural network maps the local values within the receptive field to a point in the feature map, whereas in the Fourier transform, each frequency value
w corresponds to the global time-domain information of the signal
over the time axis from
to
, as shown in the following Equation (
5). The Fourier transform can also be interpreted as a coordinate transformation in Hilbert space by selecting a set of orthogonal bases, thus achieving the conversion from the time domain to the frequency domain.
where
t is the time variable,
w is the frequency variable, and
is the frequency representation of
.
It can be seen that the convolution operation in neural networks and the Fourier transform share a certain similarity in their mapping approaches. To enable the Fourier transform to map only local signals, a window function is introduced to obtain the STFT:
where
is the sliding window function.
The STFT analyzes the local time–frequency characteristics of a signal by sliding a window function over the time-domain signal, which is similar to the movement of the receptive field in neural networks. The convolution operation without a bias term is given by:
where
is the length of the convolution kernel,
is the
mth segment of the input signal,
is the kernel weight. Based on the use of the window function
in the STFT to extract a portion of the original signal
, the segmented signal
can also be written as
. Therefore, Equation (
8) can also be described as:
The above convolution formula is actually the general transformation of the discrete STFT with a window length of one-dimensional convolution kernel length . From Equations (6), (7) and (9), we can see that the convolution operation updates the weight matrix to learn the basis in the transformation, thereby extracting TF features from the time-domain signal.
Initially, the input signals are preprocessed. The model takes 1 × 1024-dimensional complex signals as input and pads signals shorter than 1024 with zeros. Considering the large variation in the amplitude of different signals, normalization is applied to prevent the saturation of feature maps in early layers and accelerate convergence.
The reduction module maps the time domain signal to the TF feature. The input preprocessed signal
is split into I/Q data to form
. We use the real part
and imaginary part
in the complex-valued one-dimensional convolution
to perform moving convolution on the I/Q complex signal (
). According to Equation (
8), we can obtain the operation of complex-valued convolution as follows:
where,
and
are the real and imaginary parts. The convolution weight matrix
is continuously updated through the above complex-valued convolution process, guiding the network weights to adaptively learn the basis functions, and finally realizing the extraction of frequency features from the time domain signal.
3.2.2. Encoder and Decoder
The encoder–decoder network integrates and aggregates TF features extracted by the reduction module to obtain aggregated TFR. The encoder, placed at the front of the network as a feature extractor, consists of two MBConv [
31] blocks and several concatenated 2D convolutions. This structure aims to preserve the main components of TF features in the feature map while eliminating noise. The reduction module outputs feature maps
. MBConv uses
convolutions with an expansion ratio of 1, as well as pointwise convolutions with kernel size
and depthwise convolutions with kernel size
, to extract features from
and better capture and utilize its spatial and channel correlations. Each convolution layer in MBConv is followed by Leaky ReLU activation and batch normalization to introduce nonlinearity and enhance feature stability. The output feature
then enters a series of concatenated convolutional layers with
kernels. Each convolution can be seen as further abstraction and refinement of the TF features from the previous layer, allowing the model to capture key features of different scales and complexities. ReLU activation follows each convolutional layer, transforming negative values to zero and preserving positive values to enhance the network’s nonlinear properties for learning complex data patterns and features.
Specifically, the deep feature extraction of the encoder consists of 2 MBConv layers and
convolutional layers, and the feature extraction process can be expressed as:
where
and
represent the weight and bias parameters, and
is the output feature map of the
i-th layer of the encoder.
where
is the input feature map of the decoder, which comes from the last layer output of the encoder.
represents the output feature map of the
i-th layer of the decoder, and every two deconvolutional layers receive a skip connection from the encoder. Finally, all the feature maps are aggregated and upsampled to the original spatial resolution using transpose convolutional layers, resulting in a TFR output size of
, as shown in
Figure 4.
3.3. Multi-Label Modulation Identification
RMNet adopts a structure that combines ResNet with cascaded multi-head self-attention, leveraging the strengths of both convolutional operations and self-attention mechanisms to achieve complementary advantages. Convolutional operations have limited receptive fields and translation invariance properties, which can lead convolutional networks to focus on local features while neglecting global context [
32]. In contrast, the cascaded multi-head self-attention mechanism can connect information from any position, enabling the effective computation of long-range sequences and capturing dependencies between features across the entire global context. However, it has relatively poor modeling capabilities for 2D local data.
In the ResNet50 backbone network, layer1 and layer2 mainly extract low-level features, such as edges and textures. At this stage, local information in the feature maps is more critical, and traditional convolutional operations are better suited for extracting these features. The deeper layer4 mainly extracts high-level features, with a lower resolution of feature maps. Meanwhile, the mid-layer layer3 contains both local information and some global context information. Integrating cascaded multi-head self-attention modules into layer3 can effectively capture global dependencies and enhance the ability to model complex patterns. Specifically, we use six cascaded multi-head self-attention modules in layer3 to replace the traditional 3 × 3 spatial convolution. This design allows local features to be enhanced through convolutional operations while global dependencies are modeled through the self-attention mechanism, achieving a comprehensive analysis of features. The structure of RMNet is shown in
Figure 5.
Figure 6 shows the specific framework of the cascaded multi-head attention, where the input feature map
,
H and
W are the height and width, respectively. Through learned linear transformations, the input
is mapped to query
, key
, and value
[
33]. The dot product of
Q and
K computes the attention score matrix
. To stabilize gradients, a scaling factor
is introduced, where
is the dimension of each head. This scaling factor helps prevent excessively large scores, making the gradient of the Softmax function more stable. The scaled dot product matrix is then input to the Softmax function to calculate the attention activation map
. Finally, by multiplying the attention activation map
with the values
V, the final output
is obtained. The specific process is as follows:
To further integrate the attention output features, a convolutional layer is added to perform feature transformation while preserving the spatial structure. Then, a skip connection adds the input feature map
to the output after convolution and Dropout, helping the model retain shallow features while learning new ones, merging new and old features, enhancing the model’s representational capability, and mitigating the vanishing gradient problem. Finally, layer normalization is applied after the skip connection to avoid excessive differences in feature distributions between different sub-layers and further stabilizing gradient flow. The specific process is shown below:
where
is the dropout operation and
refers to the layer normalization operation.
3.4. Training
Our recognition framework is a multi-task learning model that trains corresponding tasks using different loss functions. To enable the TFGM module to generate high-quality TFRs, the choice of a loss function is crucial for achieving a good generation performance. When the TFGM learns to generate TFRs from time-domain waveforms, it transforms a one-dimensional signal matrix into a two-dimensional image matrix. In this process, the TF features only occupy a small portion of the generated two-dimensional TFR matrix, making the effective feature regions sparse within the entire TFR. Therefore, a pixel-level reconstruction loss is necessary to address this issue. The pixel-wise MSE loss is calculated as follows [
34]:
where
is the total number of pixels, and
and
represent the gray values of the
th pixel in the label time–frequency image and the generated time-frequency image, respectively.
However, the MSE loss function only focuses on pixel-level differences and is insufficient for capturing the details and linear structures of the target signal in the TFR. The shape and contour information of the signal in the TFR are critical features for the subsequent recognition network. To make the generated TFR closer to the ideal TFI, we introduce perceptual loss that measures the similarity between the generated image and the target image by comparing their feature representations in the intermediate layers of a pre-trained neural network. We employ the pre-trained VGG16 network for feature extraction. For the
jth layer of the VGG16 network, the image loss can be expressed as:
where
represents the VGG16 network,
j is the layer index,
and
are the features of the generated and target images passed through the
jth layer of the VGG16 network, and
,
,
represent the channels, height, and width of the features of the
jth layer, respectively.
The overall perceptual loss is obtained by summing the losses of all layers, represented as [
35]:
where
N is the total number of layers in the VGG16 network.
For the multi-label modulation recognition task, we use binary cross-entropy loss with logits as the loss function to compute the loss between the predicted labels and the given labels. It combines the Sigmoid layer and binary cross-entropy loss into one component, which is numerically more stable compared to using simple Sigmoid and binary cross-entropy loss separately. The loss function is described as follows:
where
is the weight of class
c,
is the true label of the sample,
is the logits output by the model, and
is the sample probability output.
Figure 7 illustrates the training process of TFGM-RMNet. The TFGM-RMNet recognition framework integrates TFR generation tasks and modulation recognition tasks. On one hand, the input time-domain signal passes through the TFGM module, calculating
and
using Equations (19) and (21), respectively, to obtain TFR. On the other hand, the TFR output by the TFGM module is processed by the RMNet feature extraction network to obtain the corresponding feature values
. The feature values are then used in Equation (
22) to calculate
, resulting in the final classification result Out2. During the testing process, only forward propagation is executed to output classification results and the expected TFR.