Specific Emitter Identification Algorithm Based on Time–Frequency Sequence Multimodal Feature Fusion Network

He, Yuxuan; Wang, Kunda; Song, Qicheng; Li, Huixin; Zhang, Bozhi

doi:10.3390/electronics13183703

Open AccessArticle

Specific Emitter Identification Algorithm Based on Time–Frequency Sequence Multimodal Feature Fusion Network

by

Yuxuan He

¹,

Kunda Wang

²,

Qicheng Song

¹,

Huixin Li

¹ and

Bozhi Zhang

^3,*

¹

School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China

²

School of Integrated Circuits and Electronics, Beijing Institute of Technology, Beijing 100081, China

³

Key Laboratory of Dynamics and Control of Flight Vehicle, Beijing Institute of Technology, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(18), 3703; https://doi.org/10.3390/electronics13183703

Submission received: 16 August 2024 / Revised: 5 September 2024 / Accepted: 15 September 2024 / Published: 18 September 2024

(This article belongs to the Special Issue Machine Learning for Radar and Communication Signal Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Specific emitter identification is a challenge in the field of radar signal processing. Its aims to extract individual fingerprint features of the signal. However, early works are all designed using either signal or time–frequency image and heavily rely on the calculation of hand-crafted features or complex interactions in high-dimensional feature space. This paper introduces the time–frequency multimodal feature fusion network, a novel architecture based on multimodal feature interaction. Specifically, we designed a time–frequency signal feature encoding module, a wvd image feature encoding module, and a multimodal feature fusion module. Additionally, we propose a feature point filtering mechanism named FMM for signal embedding. Our algorithm demonstrates high performance in comparison with the state-of-the-art mainstream identification methods. The results indicate that our algorithm outperforms others, achieving the highest accuracy, precision, recall, and F1-score, surpassing the second-best by 9.3%, 8.2%, 9.2%, and 9%. Notably, the visual results show that the proposed method aligns with the signal generation mechanism, effectively capturing the distinctive fingerprint features of radar data. This paper establishes a foundational architecture for the subsequent multimodal research in SEI tasks.

Keywords:

specific emitter identification; multimodal feature fusion; fingerprint feature; cross-attention mechanism

1. Introduction

Specific emitter identification (SEI) is a research focus in electronic reconnaissance and radar signal processing, playing a crucial role in modern communication systems [1,2,3]. Its primary objective is identifying the unique, fine-grained features within emitter signals that indicate distinct emitter identities. Because of their uniqueness and stability [4,5,6], we refer to them as fingerprint features. Specifically, variations in hardware, particularly unintentional phase modulation caused by oscillators, constitute the primary origin of individual fingerprint features [7]. The pipeline of the SEI task consists of the following steps: (1) Analyze the generation mechanisms of fingerprint features. (2) Investigate feature extraction methods or calculate the characteristic values of the signal. (3) Design identification network model. (4) Ensure that the extracted fingerprint features reflect their differences, avoiding similarity and loss of distinction.

However, present mainstream SEI technologies face numerous limitations. For instance, Zhou et al. [8] extracted BRT features from the signal, processed the features through a DAE and two layers of RBM, and ultimately classified them by a logistic regression layer. Similarly, another study [9] introduced the SST with the SAN individual identification method. Methods based on hand-crafted features require complex computations, including the contour integral of bispectrum waveform entropy and energy entropy of signals [10]. It is obvious that these methods are extremely complicated and may not produce reliable identification performance due to the unstable representation of hand-crafted features.

In contrast to traditional methods, the deep learning method can encode features into a higher-dimensional subspace, which has a stronger feature expressive capacity and is more suitable for extracting fingerprint features. The framework for SEI that applies deep learning can be grouped into three categories: methods employing a single-modal signal, those using a single-modal image, and approaches employing multimodal feature interaction. Firstly, methods employing a single-modal signal usually combine different signal processing techniques, such as FFT and wavelet transform. As an example, Zhu et al. [11] utilized single-modal signals and incorporated a range of digital signal processing technologies. The different signal transformation features were fused in subspace, thus achieving remarkable performance. Secondly, methods based on a single-modal image typically use time–frequency transformation to generate images. As a result, the signal classification problem is converted into an image classification task. Although the emitter identification performance has been improved by the above two methods, the model remains susceptible to external factors, leading to suboptimal performance in real-world scenarios. In addition, these methods are difficult to interpret, and it is hard to determine whether they align with the mechanism of individual fingerprint feature generation.

Multimodal fusion refers to the fusion and alignment of data from different domains, which has been a top research in deep learning technologies, especially in target detection and tracking. Utilizing multimodal approaches can help reduce error and uncertainty that come with the single data source, ultimately improving accuracy in identification tasks [11,12,13]. However, few studies have been conducted on SEI tasks. The paper [11] presented a very classic study using this method. The paper firstly used a CNN network to extract the time sequence feature. Then, various transformations of time sequence were conducted for feature extraction using the parameter-sharing CNN network. After that, these features were interacted and fused through a subspace interactive mutual unit, obtaining a cross-correlation matrix. However, this mutual unit included operations such as adding data of different dimensions and maximum pooling, so the physical meaning of the cross-correlation matrix was relatively abstract. In conclusion, this approach relied on high-order feature interactions in the subspace, and the correlations between these features are not well explained. The self-encoding and decoding structure in Transformer provides a new method for multimodal feature fusion.

To solve the problems aforementioned, we design a time–frequency multimodal fusion network in this paper. The core idea of our algorithm is to encode and decode on different combinations of time–frequency signals to achieve multimodal alignment. Figure 1 depicts the time signal waveform, the frequency signal waveform, and the key region of ten categories of wvd. Essentially, the hardware difference of the oscillator is the main source of individual fingerprint features. This mainly reflected in the start and end of the impulse response, as shown in the red circle of time signal. However, since this difference is so small, it is easily drowned out by noise. Wigner–Ville distribution (wvd) images provide a clear representation of how signal frequency characteristics evolve over time. The horizontal axis of the wvd image represents time information, and the vertical axis represents frequency information, which clearly shows the time–frequency characteristics of the signal, that is, how the frequency components of the signal change over time. However, the information in wvd images is relatively concentrated. This paper focuses on extracting key areas from wvd while disregarding regions with insignificant feature information. Similar to time signal, the areas with high differentiation are located at the beginning and end of the image. For example, as illustrated in Figure 1, there is an upward trend at the end of the fifth category, a slight decline at both ends of the tenth category, and a frequency turn at the beginning of the sixth category.

To sum up, this paper introduces a new identification architecture named time–frequency multimodal feature fusion network. It leverages TCN [14] to extract signal features while proposing a filtering mechanism based on maximum (FMM) for feature selection. In the image domain, the features of wvd images are extracted by a sparse attention mechanism proposed in [15] to improve computational efficiency and performance. After feature extraction, a cross-attention mechanism is used to integrate multimodal features. The main contributions of this research are as follows:

We designed a novel network structure that is based on multimodal feature fusion. It is an end-to-end identification architecture that requires no hand-crafted features. This approach solves the limitations of single-modal identification and improves the effectiveness of identification performance.
We proposed a time–frequency signal encoding method and a wvd image feature encoding method. The FMM in signal encoding presents an innovative technique to filter important features and optimize feature alignment.
We developed an alignment strategy for signal and wvd features, presenting a fresh perspective on multimodal alignment within signal analysis.

2. Related Works

Currently, there are two primary methods for radar emitters identification: one involves hand-crafted feature extraction combined with classifiers, while the other employs deep learning for identification [16].

2.1. Hand-Crafted Feature Methods

In the past, researchers commonly used the waveform curve features and combined them with the template matching method for identification, as mentioned in the literature [17,18,19]. The waveform matching method is only applicable to a few radar signals, so researchers started to explore new methods. After that, high-order statistics methods were introduced in SEI. For example, Pei et al. [19] extracted several individual features from the signal, such as radio frequency and pulse width, which were utilized for classification using the k-nearest neighbor technique. Zhang et al. [20] employed a nonparametric technique to estimate the differences in various characteristics such as amplitude spectrum, diagonal slices, energy amplitude, and frequency of the bispectrum. However, these methods were found to be complex, unreliable, and only applicable to a limited subset of radar signals, lacking general applicability. Consequently, the methods that depended solely on hand-crafted features are gradually being phased out and replaced by a combination of deep learning.

To summarize, the main disadvantages of methods that rely on hand-crafted features are their high computational requirements and slow processing speeds. Additionally, these features cannot represent universal radar emitters and are overly sensitive to noise, ultimately resulting in subpar generalization performance. Deep learning utilizes neural networks to automatically extract high-dimensional features from signals or images, eliminating the need for manual feature design and becoming the prevailing approach for identification.

2.2. Deep Learning Methods

Deep learning has made significant advancements in various application fields. The integration of deep learning with radar emitter identification has shown superior efficiency [21]. For example, Shan et al. [22] adopted an RBM/AE network to compress the original sequence and obtain a low-dimension feature representation of the signal. However, training the network is a challenging task as it requires a large amount of data, is highly sensitive to input data, and heavily relies on data preprocessing [23]. Xu et al. [24] took a different approach by using the signal envelope as input for the deep belief network to extract fingerprint features. However, their approach only focused on analyzing raw signal sequences, which led to suboptimal identification results in situations with low SNR.

At the same time, some scholars choose to extract certain manual features for the guidance of model learning. Zhou et al. [8] commenced by extracting the BRT characteristics from signals, which were then fed into a DAE. However, the BRT feature will capture some accidental changes in the signal, resulting in insufficient generalization ability of the model. In [9], the SST feature was introduced. The signal was processed using the Synchrosqueezing transform and then classified using the SAN model. Although good performance is achieved, SST feature is easily affected by noise.

Both the time and frequency domains contain unique fingerprint features [25,26]. Based on this, Ru et al. [27,28] extracted and classified the distribution density, Euclidean distance, cross-correlation, and skewness of the signal spectrum and achieved good results. An alternative approach discussed in the literature [29,30,31] involves applying the short-time Fourier transform to generate time–frequency images. However, it introduces a new challenge: the reliance on the quality of image datasets.

All in all, deep-learning-based methods have made significant strides in performance, but there are still ways to improve. Instead of focusing on the network module design, the multimodal methods have more positive effect on the performance.

2.3. Multimodal Methods

In the realm of radar signal identification, there is a need for a more comprehensive architecture of multimodal feature fusion. Previous work by Zhang et al. [32] designed a dual-branch network, where one processes the signal feature and the other extracts image features. Then, the feature vectors from the two branches are connected for the final classification. Although this approach shows some improvement in performance, it lacks interpretability and fails to achieve the alignment of the two modal features.

Currently, the majority of research in the field of multimodal fusion is concentrated on integrating visual imagery with semantic information. In this context, many studies leverage the encoder–decoder architecture of the Transformer model. The core of effective multimodal fusion lies within the design of the query in the cross-attention mechanism. The query itself is a sequence of features that can be designed according to different application scenarios. By extracting the features from a signal and embedding them into the query vector, we can then utilize the vector to decode the features from wvd image. This process facilitates the alignment of features across different modalities.

3. Research Methodology

A brief introduction of our algorithm is depicted in Figure 2, with a more detailed network structure provided below. The upper branch utilizes TCN to process time–frequency signals, employs FMM to select key feature points, and adopts self-attention mechanisms for feature encoding [33]. The upper branch is responsible for feature extraction, signal embedding, and obtaining the query vector. On the other hand, the lower branch handles image data through patch embedding and sparse attention mechanism for image feature encoding. Subsequently, features from both modalities are fused through a cross-attention mechanism, and classification is carried out using a multilayer perceptron layer.

3.1. Signal Feature Extraction and Embedding

The first section involves extracting features from the original radar signal and filtering features to acquire the query vector, as illustrated in Figure 3. The time signal is the initial sequence and its frequency sequence obtained by Fourier transform [34] can be considered as an additional feature. By combining these two sequences in the feature dimension, we can obtain more comprehensive signal feature information. It represents the first interaction of time–frequency signals. The signal dimension feature vector

s \in R^{l e n \times 2}

is produced by this process. Here,

l e n

denotes the length of signal sequence, and 2 represents the time and frequency feature domain.

Then, TCN is employed to capture causal relationships in radar signals. Using causal convolutions, TCN ensures the causality of output sequence, which is essential for signal analysis. Additionally, TCN employs dilated convolutions to expand the receptive field of the feature points, enabling the network to capture long-range dependencies in sequences. Moreover, the residual structure of TCN helps maintain stable gradients in the training. Compared to traditional networks like CNNs and RNNs, TCN is better suited for extracting causal signal features in real-world scenarios, reducing information loss. The computational procedure for every feature point within the sequence generated by TCN can be described using Equation (1).

\begin{matrix} F (s) = (x_{d} * f) (s) = \sum_{i = 0}^{k - 1} f (i) \cdot x_{s - d \cdot i} \end{matrix}

(1)

\begin{matrix} o = Activation (x + F (x)) \end{matrix}

(2)

where F represents the convolution process of TCN, with s representing the input sequence. Here, x signifies points within the sequence, d denotes the interval of the dilated convolution, and i corresponds to the number of hidden layers in the TCN. Meanwhile,

s - d \cdot i

refers to the sequence points preceding the focal output feature point. Equation (2) denotes the output feature points connected via the residual structure, with o signifying the final output feature points. These points undergo dilated convolution and pass through i hidden layers, capturing richer contextual semantic information. As a result of this process, the final output sequence

O \in R^{l e n \times d}

.

It is evident that TCN does not change the sequence length but can expand the feature dimensions of the signal. The query we aim for consists of a set of feature points, each representing the most distinctive feature of the signal. Furthermore, individual fingerprint features are primarily concentrated during transient signal fluctuations, such as the rising and falling edges of radar pulses, rather than in stable emission periods [35]. Based on this principle, we propose the FMM to filter the feature points obtained by TCN. It aims to select the top 20% of feature points in O that best capture the distinctive fingerprint features. The selected vectors are then recombined to form new feature vectors, denoted as

F n s i g

, which encapsulate more critical feature information. Finally, position encoding is applied to the newly formed vectors. This process is formalized in Equations (3) and (4).

\begin{matrix} f_{p o s} = {max}_{d} (O \cdot W^{T}) \end{matrix}

(3)

\begin{matrix} F n s i g = P_{1 d} {t o p k_{20 %} (f_{p o s})} \end{matrix}

(4)

Here, O serves as the output of the TCN. We introduce a learnable parameter matrix

W \in R^{c l s \times d}

to change the feature dimension of O to the number of categories, which is 10 in this article. After multiplication,

O \cdot W^{T} \in R^{l e n \times c l s}

represents the classification value of each feature point, and the maximum value of the

c l s

dimension is the classification category of this feature point. The

m a x_{d}

signifies taking the maximum value in the feature dimension, so the

f_{p o s} \in R^{l e n}

selects the maximum feature value of each feature point. After that, we complete the content-based filtering, and then we need to filter again based on position. Here, we directly take the top 20% of

f_{p o s}

and reconstruct a new vector without changing the position relationship.

P_{1 d}

represents the position encoding. Finally, the self-attention mechanism is used to encode these points to obtain the final query vector. In summary, this process accomplishes the extraction of signal features and obtains the query vector.

3.2. Image Feature Extraction and Embedding

The wvd image feature embedding process is shown in Figure 4. The wvd images often contain substantial redundant information, necessitating a focus on the most salient feature regions for analysis. To perform this, we initially extract the significant part of the image by employing edge detection and RGB filtering techniques. This approach enables more effective identification and analysis of critical information within the image, thereby enhancing the accuracy and efficiency of image processing. After that, we divide these salient regions into patches and convert them into a two-dimensional sequence using an embedding layer and positional encoding. These patches are subsequently encoded using the sparse attention mechanism.

The comparison of sparse attention and cross-attention is shown in Equations (5) and (6).

A_{m q k}

represents the attention weight matrix. k in the original self-attention mechanism is the sequence length of the input query, while k in the sparse attention mechanism represents the number of associated feature points. Therefore, the sparse attention mechanism only calculates the most relevant feature weights and ignores the unimportant parts, thereby improving computational efficiency. In the context of fingerprint feature generation, this method is expected to accurately identify high-energy regions in wvd images, which are essential for detecting the beginning and ending points of radar signals. Figure 5 compares the attention weight matrix, with blue highlighting significant features. It is evident that this strategy helps reduce the interference of noise and places attention on the most influential factors.

\begin{matrix} Self - Attn = \sum_{m = 1}^{M} W_{m} [\sum_{k} A_{m q k} \cdot W_{m}^{'} x_{k}] \end{matrix}

(5)

\begin{matrix} Deform - Attn = \sum_{m = 1}^{M} W_{m} [\sum_{k} A_{m q k} \cdot W_{m}^{'} x (p_{q} + Δ p_{m q k})] \end{matrix}

(6)

After extracting time–frequency features from both signal and wvd modalities, the next topic to be discussed is the multimodal feature fusion module.

3.3. Multimodal Features Fusion

Current research on multimodal feature interactions predominantly emphasizes the integration of visual data with semantic information, while comparatively little attention has been directed toward exploring signal processing. Existing methods face challenges in achieving precise cross-modal alignment in unpredictable or uncontrolled environments. This section introduces a multimodal feature fusion module based on time–frequency interaction. By mapping the time–frequency features of both signals and images into a high-dimensional subspace, and subsequently weighting these high-dimensional features into the single-modal image, multimodal feature fusion can be achieved. This approach facilitates a deep interaction of time–frequency features from signals and images, enabling the complementarity of different modal data and enhancing the extraction of fingerprint features.

As shown in Equation (7),

F_{n s i g}

denotes the query vector obtained by the signal processing module, while

F_{n i m g}

is the image-encoded feature vector.

\begin{matrix} F_{n s i g} = {[n_{1}, n_{2}, n_{3} \dots n_{i}]}^{T} \\ F_{n i m g} = [m_{1}, m_{2}, m_{3} \dots m_{j}] \end{matrix}

(7)

Here,

n_{i}

and

m_{j}

denote high-dimensional feature vectors partitioned according to positions. The cross-mechanism is applied for fusing these features. Specifically,

F_{n i m g}

is employed as the key and value, whereas

F_{n s i g}

serves as the query for decoding

F_{n i m g}

. Signal concatenation in feature dimension and wvd transformation are two different interaction modes of the time–frequency information, and there are correlations between them. A high-dimensional matrix S can be used to express this correlation. Specifically, it represents the relevance weight of features between two modalities, which can be applied to the image-encoded vector. After this process, we can achieve the weighting of multimodal interaction information in the image field, thus completing multimodal fusion. The process is formulated in Equation (8).

\begin{matrix} S = F_{n s i g} \cdot F_{n i m g} & = {[n_{1}, n_{2}, n_{3} \dots n_{i}]}^{T} \cdot [m_{1}, m_{2}, m_{3} \dots m_{j}] \\ = [\begin{matrix} s_{11} & s_{12} & \dots & s_{1 j} \\ s_{21} & s_{22} & \dots & s_{2 j} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ s_{i 1} & s_{i 2} & \dots & s_{i j} \end{matrix}] \end{matrix}

(8)

Here,

s_{i j} = n_{i} \cdot m_{j}

, which represents the weight between the signal vector position i and the image feature vector position j. Since S is a positional weight matrix without any specific meaning, it needs to be added to the image feature vectors that carry actual semantic information. In discussion, S is utilized as a weight matrix to be weighed on the image feature vector, thereby constructing multimodal features P.

\begin{matrix} P = S \cdot F_{n i m g} = (\begin{matrix} p (1, :) \\ p (2, :) \\ ⋮ \\ p (i, :) \end{matrix}) = (\begin{matrix} s (1, :) \\ s (2, :) \\ ⋮ \\ s (i, :) \end{matrix}) \cdot (\begin{matrix} m_{1} & m_{2} & \dots & m_{j} \end{matrix}) \end{matrix}

(9)

The feature vectors in P are denoted by

p (i, :) = s (i, :) \cdot F_{n i m g}

.

s (i, :)

represents the fused feature information at position i of the signal after integrating with all the image features, and

p (i, :)

represents the outcome of weighting fused information

s (i, :)

to position i of the signal. Physically, it reflects the feature vector at position i of the signal after multimodal fusion. Through sequential weighting, the signal feature P can be obtained, which is subsequently fed into the multilayer perceptron for the final classification.

4. Results and Discussion

Our experiments consist of four parts: (1) Quantitative analysis experiment. This section compares our algorithm with four baselines and four high-performance methods and comprehensively analyzes its performance through five evaluation indicators. In addition, we use confusion matrix to visualize the challenging categories, further demonstrating the fine-grained identification performance of our approach.

(2) Ablation analysis experiment. Firstly, we verify the impact of FMM on signal feature extraction. Then, we verify the improvement of computational efficiency by FMM and sparse attention mechanism. Then, the accuracy of three multimodal fusion methods is evaluated. Finally, the relationships between multimodal data and the benefits of our approach are analyzed by testing seven different modality combinations.

(3) Visual analysis experiment. This section visualizes the attention weights of the multimodal module and VIT, demonstrating that the proposed method aligns with the distribution characteristics of individual fingerprint features.

(4) Robust analysis experiment. This section tests the algorithm’s accuracy under low-SNR and imbalanced sample conditions, validating its potential for real-world applications.

4.1. Experiment Setup

The dataset for this experiment was obtained through fieldwork. All ten radar emitters are produced in the same batch, on the same production line, and of the same model. The radar model is TXR-30F, the frequency range is 0.9–1.3 GHz, the transmission power is 30 KW, the pulse width is 1–999 μs, and it can transmit square wave, sine wave, and other pulse waveforms. In addition, we set the radar emitters to emit the same signal as shown in Table 1 and the same receiver was employed. No modulation or demodulation techniques were applied at the receiver. Each emitter contains 1000 short pulse signals. Signal analysis was conducted using FFT to acquire a time–frequency combination sequence and employing wvd method for imagery. The dataset was then randomly split into training and testing sets with a ratio of 6:4.

In the experiment, we use several metrics to evaluate the performance of different methods:

a c c u r a c y

,

p r e c i s i o n

,

r e c a l l

,

F 1 - s c o r e

, and precision–recall (PR) curve. In classification tasks,

a c c u r a c y

represents the proportion of samples correctly predicted by the model to the total number of samples, which represents the ability of the model to correctly classify.

P r e c i s i o n

measures the proportion of samples that are actually positive among those predicted by the model. A high precision means that the model rarely makes wrong predictions among samples predicted as positive.

R e c a l l

indicates how many of the samples that are actually positive are correctly identified by the model, reflecting the sensitivity of the model to positive samples.

F 1 - s c o r e

is an indicator that comprehensively considers precision and recall, and is used to evaluate the overall performance of the model in classification tasks. The PR curve is obtained by calculating average precision and recall at different thresholds. It reflects the relationship between precision and recall.

\begin{matrix} A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N} \\ P r e c i s i o n = \frac{T P}{T P + F P} \\ R e c a l l = \frac{T P}{T P + F N} \\ F 1 - S c o r e = \frac{2 * p r e c i s i o n * r e c a l l}{p r e c i s i o n + r e c a l l} \end{matrix}

(10)

where

T P

represents the number of samples that the model correctly predicts as a positive category, and

F P

represents the number of samples that the model incorrectly predicts as a positive category. Similarly,

T N

represents the number of samples that the model correctly predicts as the negative category, and

F N

represents the number of samples that the model incorrectly predicts as the negative category, which means the samples that actually belong to the positive category but are predicted as the negative category by the model.

4.2. Quantitative Analysis

Our algorithm is compared against eight methods in the SEI. These methods include four classic baselines: LSTM, Transformer, Resnet50, and VIT. In addition, we also make comparisons with four high-performance methods, including Xu’s method using signal envelopes and deep belief networks (DBNs) [29], BRT features processed by denoising autoencoders (BRT + DAE) [8], the SST + SAN identification approach [9], and the direct application of signal waveforms as images into VIT [36]. Detailed comparative results can be found in Table 2. Our method demonstrates superior performance across all four evaluation indicators—precision, recall, F1-score, and accuracy. Specifically, our results show an improvement of 8.2% in precision, 10% in recall, 9% in F1-score, and 9% in accuracy compared to the second performance method.

Figure 6 depicts the PR curve of nine comparison methods. It is evident from the figure that our method outperforms all other methods, demonstrating the best performance.

In Figure 7, the confusion matrix compares our proposed SEI method with four initial approaches. Among the four SEI algorithms, the categories that significantly impact accuracy are the first, eighth, ninth, and tenth. This is mainly due to the high similarity of these signals, which makes existing methodologies unable to accurately isolate distinct features. In contrast, our algorithm utilizes a multimodal approach, allowing for effective feature fusion across different dimensions. As a result, our method has stronger fine-grained identification capability.

4.3. Ablation Experiment

In this section, we first verify the effects of FMM and sparse attention mechanism. Then, we test the performance of the various combinations of modal data. Finally, we compare three multimodal fusion methods and demonstrate the performance of the cross-attention mechanism.

In order to verify the performance of the combination of FMM and TCN, the comparison is made using different signal feature extraction networks, including TCN, CNN, LSTM, CNN-LSTM, and RNN. Table 3 presents the comparison results. The data show that when using FMM, TCN has the highest accuracy, which is 7.6% higher than the CNN-based approach. As analyzed above, the causal expansion convolution in TCN ensures the causality of the radar signal and large receptive field of each output feature point. When combined with FMM, it can generate better identification results. On another hand, from Table 3, we can see that without FMM, the accuracy of various methods remains relatively consistent, in that FMM only selects the most significant feature points rather than enhance the feature extraction ability of the network. However, the FMM filtering mechanism can improve the efficiency of correlation calculation. Table 4 shows the results. It can be observed that FMM reduces the FLOPs of network by 0.83 G and increases FPS by 27 without increasing the parameters of network. Meanwhile, the sparse attention mechanism also reduces the FLOPs by 0.08 G but the FPS declines to 36. Overall, FMM with sparse attention mechanism can improve the computational efficiency by 46.15%.

The above experiment proves the positive effect of FMM and sparse attention mechanism. Then, we also conducted a visualization experiment on FMM to further prove its key role in selecting feature points. Because we divide the salient areas of the wvd image into

20 \times 9

image blocks, we can prove its effect in feature filtering by statistically analyzing the maxima distribution of attention weight in multimodal fusion. Figure 8 shows the results. Figure 8a shows the distribution of maxima feature points with FMM, while Figure 8b presents the comparison item. It is evident that the FMM makes the significant feature points more concentrated, which results in higher classification confidence and faster fitting during model training.

In order to verify the performance of the multimodal feature fusion module, we compared three different methods. Zhang [32] designed a dual-branch network, where one branch extracts signal features and the other extracts image features. Finally, the features of the two modalities are directly concatenated for classification. Another method is to add the self-attention mechanism before the category head, allowing the concatenated features to further interact in the feature space before classification. As can be seen from Figure 9, the cross-attention is more effective. Compared with the bilinear method and self-attention, its identification accuracy is improved by 9.1% and 5.8%, respectively.

Finally, we tested the performance of different modality data combinations. We validated our method by separately examining six distinct methods: single time sequence, single frequency sequence, wvd images, the fusion of time–frequency sequence, the combination of time sequence with wvd image, and the integration of frequency sequence with wvd image. The accuracy curves are shown in Figure 10.

In single-modal feature classification, the single time sequence achieves an accuracy of 73.2%. However, the frequency sequence exhibits a lower performance at 59.1%. This discrepancy could potentially be attributed to the dataset consisting of single-frequency signals characterized by spectra with sharp peaks and concentrated energy, masking the fingerprint features. However, combining time and frequency sequences shows a substantial performance enhancement compared to using solely single-modal time sequence, achieving an accuracy of 85.4%. This improvement demonstrates that the simple fusion of time and frequency sequence in feature dimension can improve classification performance. It further confirms that, despite using the same data source, features from different transformations are complementary, leading to improved overall identification accuracy. Furthermore, when combining the time or frequency sequence with the wvd image, a slight decrease in accuracy can be observed in contrast to the simple fusion of time and frequency sequence. This decline can be attributed to the unalignment of signal and image features. Since wvd images are generated from time–frequency sequences, the absence of any sequence in the signal domain can complicate feature alignment, ultimately resulting in reduced recognition performance. As a result, the identification accuracy is 84.2% and 82.8%.

4.4. Visual Analysis

In this section, to demonstrate that our proposed method effectively aligns multimodal features and is consistent with the mechanism of radar individual fingerprint generation, we visualize the attention weights of our method and VIT, as shown in Figure 11. Comparing the third and fourth columns, it is obvious that the attention distribution of our method is sparse and accurate. It was already found in Figure 8 that the maxima feature points are predominantly concentrated at both ends of the wvd’s significant region. Moreover, it can be seen from Figure 11 that the maxima feature points are also distributed at the same position. This is consistent with the mechanism of individual fingerprint generation [35]. In contrast, the attention distribution of VIT is scattered and the focus is not clear.

Our method employs a sparse attention mechanism in the image encoding module and FMM in the signal encoding module. Both methods can improve the computational efficiency of attention. In addition, the multimodal module can achieve the alignment of multimodal features, promoting attention to focus more on the aligned part. However, the Vit only utilizes a self attention mechanism, resulting in a considerable redundancy in the attention feature map, and the key areas are not prominent.

4.5. Robustness Analysis

Considering the actual application scenarios, the number of different categories is not the same, and the radar signal will be interfered with by complex factors. Therefore, in this section, we analyze the robustness of our algorithm under two conditions: noise interference conditions and sample imbalance.

We add Gaussian white noise at intervals of 5 dB across a range from −10 dB to 20 dB to both the training and testing data while maintaining consistent SNR between the two parts. The results obtained are depicted in Figure 12. The noise resistance performance clearly shows that our algorithm outperforms other methods, demonstrating superior noise resistance capabilities. In addition, our method achieves an accuracy of over 80% in an environment even in an SNR of 5 dB, indicating its ability to perform well under complex interference environments. This adaptability makes our algorithm more suitable for real-world scenarios.

Similarly, in electronic communication scenarios, radar signal capture can be challenging, leading to unevenly distributed data samples across categories. This imbalance can impact the effectiveness of identification performance. To validate our proposed method under realistic conditions, we modified the dataset by reducing training data: 60% for categories 1–3, 40% for categories 4–6, and 20% for categories 7–9. The training set for category 10 remained unchanged, while the test set sizes were unaltered. This resulted in a new imbalanced dataset, allowing us to compare the performance of our method against existing approaches. The accuracy curve is illustrated in Figure 13. It is evident that the accuracy of all SEI methods declines to some degree. This decline occurs because the neural network, when extracting features from unbalanced training samples, cannot fully capture the fingerprint information of all emitters, leading to limitations in identification. However, our algorithm outperforms other methods due to its multimodal feature fusion. Even when some categories have fewer samples, it excels at extracting significant features, with an accuracy exceeding 80%.

5. Conclusions

In this paper, we designed a novel time–frequency multimodal feature fusion network for SEI tasks. It includes a time–frequency signal encoding module, a wvd image feature encoding module, and a multimodal feature fusion module. These address challenges found in single-modal conditions. The algorithm offers an end-to-end approach, requiring only the original time sequence. The time–frequency signal encoding module consists of TCN, FMM, and self-attention mechanism, which process the signal data and generate the query vector. Then, the wvd image feature encoding module splits the image into blocks and encodes them using a sparse attention mechanism. Finally, the multimodal feature fusion module uses a query vector to decode image features, allowing the feature interaction in the high-dimensional subspace. In addition, during the design process, attention was given to the generation mechanism of radar individual fingerprint features, and FMM was proposed to select the maximum feature points. Quantitative analysis and ablation experiments proved that our method has superior identification performance. In addition, the visual experiment proved that our method can effectively align multimodal features and is consistent with the generation mechanism of fingerprint features. Moreover, robustness analysis proved that our method still performs well in the case of sample imbalance and low SNR.

Our method can be applied to the demodulated radar intermediate frequency signal. Future work will focus on improving our algorithm’s performance and exploring its mechanism further. Additionally, the research will concentrate on open-set identification of unknown radiation sources in electronic warfare scenarios, addressing challenges such as small sample sizes, more individual categories, and intentional interference environments.

Author Contributions

Conceptualization, Y.H. and K.W.; methodology, Y.H.; validation, Q.S.; formal analysis, H.L.; investigation, Y.H.; data curation, Y.H.; writing—original draft preparation, Y.H.; writing—review and editing, K.W.; visualization, Q.S.; project administration, B.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset can be obtained by contacting the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Talbot, K.I.; Duley, P.R.; Hyatt, M.H. Specific emitter identification and verification. Technol. Rev. 2003, 113, 113–130. [Google Scholar]
Liu, M.W.; Doherty, J.F. Nonlinearity estimation for specific emitter identification in multipath channels. IEEE Trans. Inf. Forensics Secur. 2011, 6, 1076–1085. [Google Scholar]
Carroll, T. A nonlinear dynamics method for signal identification. Chaos Interdiscip. J. Nonlinear Sci. 2007, 17, 023109. [Google Scholar] [CrossRef] [PubMed]
Wiley, R. ELINT: The Interception and Analysis of Radar Signals; Artech: London, UK, 2006. [Google Scholar]
DeYoung, D.; Dahlburg, J.; Bevilacqua, R.; Borsuk, G.; Boris, J.; Chang, S.; Colton, R.; Eisenhauer, R.; Eppert, H.; Franchi, E. Fulfilling the Roosevelts’ Vision for American Naval Power (1923–2005). AGRIS 2006, 17, 73. [Google Scholar]
Xu, D. Research on Mechanism and Methodology of Specific Emitter Identification. Ph.D. Thesis, National University of Defense Technology, Changsha, China, 2008. [Google Scholar]
Liu, M.; Chai, Y.; Li, M.; Wang, J.; Zhao, N. Transfer Learning-Based Specific Emitter Identification for ADS-B over Satellite System. Remote Sens. 2024, 16, 2068. [Google Scholar] [CrossRef]
Zhou, Y.; Wang, X.; Chen, Y.; Tian, Y. Specific emitter identification via bispectrum-radon transform and hybrid deep model. Math. Probl. Eng. 2020, 2020, 7646527. [Google Scholar] [CrossRef]
Zhu, M.; Feng, Z.; Zhou, X.; Xiao, R.; Qi, Y.; Zhang, X. Specific emitter identification based on synchrosqueezing transform for civil radar. Electronics 2020, 9, 658. [Google Scholar] [CrossRef]
Yuan, S.; Li, P.; Wu, B. Radar Emitter Signal Intra-Pulse Modulation Open Set Recognition Based on Deep Neural Network. Remote Sens. 2023, 16, 108. [Google Scholar] [CrossRef]
Zhu, Z.; Ji, H.; Li, L. Deep multimodal subspace interactive mutual network for specific emitter identification. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 4289–4300. [Google Scholar] [CrossRef]
Tian, T.; Zhang, Q.; Zhang, Z.; Niu, F.; Guo, X.; Zhou, F. Shipborne multi-function radar working mode recognition based on DP-ATCN. Remote Sens. 2023, 15, 3415. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, W.; Chen, W.; Chen, C.; Liang, Z. MFFnet: Multimodal Feature Fusion Network for Synthetic Aperture Radar and Optical Image Land Cover Classification. Remote Sens. 2024, 16, 2459. [Google Scholar] [CrossRef]
Lea, C.; Vidal, R.; Reiter, A.; Hager, G.D. Temporal convolutional networks: A unified approach to action segmentation. In Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8–10 and 15–16, 2016, Proceedings, Part III 14; Springer: Berlin/Heidelberg, Germany, 2016; pp. 47–54. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Ding, L.; Wang, S.; Wang, F.; Zhang, W. Specific emitter identification via convolutional neural networks. IEEE Commun. Lett. 2018, 22, 2591–2594. [Google Scholar] [CrossRef]
Langley, L.E. Specific emitter identification (SEI) and classical parameter fusion technology. In Proceedings of the Proceedings of WESCON’93, San Francisco, CA, USA, 28–30 September 1993; pp. 377–381. [Google Scholar]
Jiang, P. Subtle Characteristic Analysis and Recognition of Radar Signals. Ph.D. Thesis, Harbin Engineering University, Harbin, China, 2012. [Google Scholar]
Chen, P.-B.; Li, G. Applying Dynamic Time Warping Algorithm to Specific Radar Emitter Identification. J. Signal Process. 2015, 31, 1035–1040. [Google Scholar]
Zhang, Z.; Chang, J.; Chai, M.; Tang, N. Specific emitter identification based on power amplifier. Int. J. Perform. Eng. 2019, 15, 1005. [Google Scholar] [CrossRef]
Pan, J.; Guo, L.; Chen, Q.; Zhang, S.; Xiong, J. Specific radar emitter identification using 1D-CBAM-ResNet. In Proceedings of the 2022 14th International Conference on Wireless Communications and Signal Processing (WCSP), Nanjing, China, 1–3 November 2022; pp. 483–488. [Google Scholar]
Shan, S.; Kan, M.; Liu, X.; Liu, M.; Wu, S. Deep learning: The revival and transformation of multilayer neural networks. Sci. Technol. Rev. 2016, 34, 60–70. [Google Scholar]
Mou, F.; Fan, Z.; Jiang, C.; Zhang, Y.; Wang, L.; Li, X. Double Augmentation: A Modal Transforming Method for Ship Detection in Remote Sensing Imagery. Remote Sens. 2024, 16, 600. [Google Scholar] [CrossRef]
Cheng, S.; Dong, X. Radar specific emitter identification based on DBN feature extraction. J. Air Force Eng. Univ. Nat. Sci. Ed. 2020, 20, 91–96. [Google Scholar]
Wang, B.; Xie, J.; Wang, F. Specific Emitter Identification Based on ACGAN and STFT. In Proceedings of the 2024 7th International Conference on Advanced Algorithms and Control Engineering (ICAACE), Shanghai, China, 1–3 March 2024; pp. 400–403. [Google Scholar]
Dong, W.; Wang, Y.; Sun, G.; Xing, M. A Specific Emitter Identification Method Based on Time-Frequency Feature Extraction. In Proceedings of the IGARSS 2023—2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023; pp. 6302–6305. [Google Scholar]
Ru, X.; Huang, Z.; Liu, Z.; Jiang, W. Frequency-domain distribution and band-width of unintentional modulation on pulse. Electron. Lett. 2016, 52, 1853–1855. [Google Scholar] [CrossRef]
Ru, X.H.; Liu, Z.; Huang, Z.T.; Jiang, W.L. Evaluation of unintentional modulation for pulse compression signals based on spectrum asymmetry. IET Radar Sonar Navig. 2017, 11, 656–663. [Google Scholar] [CrossRef]
Wang, X.; Huang, G.; Zhou, Z.; Tian, W.; Yao, J.; Gao, J. Radar emitter recognition based on the energy cumulant of short time Fourier transform and reinforced deep belief network. Sensors 2018, 18, 3103. [Google Scholar] [CrossRef] [PubMed]
Gok, G.; Alp, Y.K.; Arikan, O. A new method for specific emitter identification with results on real radar measurements. IEEE Trans. Inf. Forensics Secur. 2020, 15, 3335–3346. [Google Scholar] [CrossRef]
Zhou, Z.W.; Huang, G.M.; Gao, J.; Man, X. Radar emitter identification algorithm based on deep learning. J. Xidian Univ. 2017, 44, 85–90. [Google Scholar]
Zhang, S.; Chen, S.; Chen, X.; Liu, Y.; Wang, W. Active deception jamming recognition method in multimodal radar based on small samples. J. Radar 2023, 12, 882–891. [Google Scholar]
Qian, Y.; Qi, J.; Kuai, X.; Han, G.; Sun, H.; Hong, S. Specific emitter identification based on multi-level sparse representation in automatic identification system. IEEE Trans. Inf. Forensics Secur. 2021, 16, 2872–2884. [Google Scholar] [CrossRef]
Liao, Y.; Li, H.; Cao, Y.; Liu, Z.; Wang, W.; Liu, X. Fast Fourier Transform with Multi-head Attention for Specific Emitter Identification. IEEE Trans. Instrum. Meas. 2023, 73, 1–12. [Google Scholar] [CrossRef]
Guo, P. Research on Radar Radiation Source Individual Recognition Technology Based on Deep Learning. Master’s Thesis, Xidian University, Xi’an, China, 2022. [Google Scholar]
Zhang, M.; Diao, M.; Gao, L.; Liu, L. Neural Networks for Radar Waveform Recognition. Symmetry 2017, 9, 75. [Google Scholar] [CrossRef]

Figure 1. The illustration showcases the temporal waveform, spectrum, and salient portions extracted from wvd images. The first row of wvd represents categories 1 through 5, while the second row signifies categories 6 through 10. The red circles indicate where the individual fingerprints exist.

Figure 2. Overview of our algorithm. The network has two branches for feature extraction and encoding on signals and images. It uses the multimodal information fusion module to align these features, which then support the final classification.

Figure 3. The time–frequency signal encoding module. After the fusion of time–frequency signals, features are extracted through TCN, and then FMM is used to filter features. Finally the self-attention mechanism is used for feature encoding.

Figure 4. Structured information extraction module of wvd images.

Figure 5. Comparison of attention matrices between Transformer and Deformable Transformer.

Figure 6. The precision–recall curves for the various methodologies.

Figure 7. Comparative confusion matrices illustrating the performance of five methods.

Figure 8. The distribution statistics of the maximum feature points of 400 samples in the first category.

Figure 9. Comparative evaluation of three multimodal feature fusion strategies.

Figure 10. Ablation experimental results of different modal array combinations.

Figure 11. Visual comparison analysis of categories (1–10). The first column represents the time waveform, the second depicts the wvd image, the third exhibits the attention visualization of our method, and the fourth represents the attention visualization of VIT.

Figure 12. Identification performance under different SNR.

Figure 13. Identification performance under unbalanced samples.

Table 1. The introduction of the signal pulse parameters of our dataset.

Variables	PRI	IF	PW	Modulation
Pulse parameters	120 μs	1.2 GHz	24 μs	No modulation

Table 2. Evaluation metrics for eight identification methods.

Method	Precision	Recall	F1-Score	Accuracy
LSTM	65.7%	61.5%	60.7%	61.6%
Resnet50	85.6%	83.5%	84.4%	84.6%
Transformer	81.6%	79.5%	77.5%	79.0%
VIT	63.8%	63.7%	58.6%	63.7%
Envelope + DBN	72.2%	70.7%	69.4%	79.0%
BRT + DAE	74.8%	74.4%	74.3%	74.4%
SST + SAN	80.2%	78.9%	79.1%	78.9%
Waveform + VIT	85.7%	84.3%	84.6%	84.3%
TFMFIN	93.9%	93.5%	93.6%	93.6%

Table 3. Compare different signal feature extraction methods combined with FMM or not.

Method	FMM	CNN	LSTM	CNN-LSTM	RNN	TCN
Accuracy	✓	66.3%	70.7%	72.9%	68.5%	73.9%
Accuracy	✗	65.2%	69.2%	73.2%	69.4%	71.0%

Table 4. The impact of FMM and sparse attention mechanism on computation efficiency.

FMM	Sparse-Attn	FLOPs	Parameters	Inference FPS
✓	✓	1.05 G	1.35 M	36
✗	✓	1.88 G	1.35 M	9
✓	✗	1.13 G	1.36 M	40
✗	✗	1.95 G	1.35 M	12

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, Y.; Wang, K.; Song, Q.; Li, H.; Zhang, B. Specific Emitter Identification Algorithm Based on Time–Frequency Sequence Multimodal Feature Fusion Network. Electronics 2024, 13, 3703. https://doi.org/10.3390/electronics13183703

AMA Style

He Y, Wang K, Song Q, Li H, Zhang B. Specific Emitter Identification Algorithm Based on Time–Frequency Sequence Multimodal Feature Fusion Network. Electronics. 2024; 13(18):3703. https://doi.org/10.3390/electronics13183703

Chicago/Turabian Style

He, Yuxuan, Kunda Wang, Qicheng Song, Huixin Li, and Bozhi Zhang. 2024. "Specific Emitter Identification Algorithm Based on Time–Frequency Sequence Multimodal Feature Fusion Network" Electronics 13, no. 18: 3703. https://doi.org/10.3390/electronics13183703

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Specific Emitter Identification Algorithm Based on Time–Frequency Sequence Multimodal Feature Fusion Network

Abstract

1. Introduction

2. Related Works

2.1. Hand-Crafted Feature Methods

2.2. Deep Learning Methods

2.3. Multimodal Methods

3. Research Methodology

3.1. Signal Feature Extraction and Embedding

3.2. Image Feature Extraction and Embedding

3.3. Multimodal Features Fusion

4. Results and Discussion

4.1. Experiment Setup

4.2. Quantitative Analysis

4.3. Ablation Experiment

4.4. Visual Analysis

4.5. Robustness Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI