**3. Deep Learning Framework for Signal Detection and Modulation Recognition**

DL networks aim at learning different hierarchies of features from data. As one of the branches of DL techniques, the CNNs perform well in the field of image recognition. It performs feature learning via non-linear transformations implemented as a series of nested layers. Each layer consists of several kernels that perform a convolution over the input. Generally, the kernels are usually multidimensional arrays which can be updated by some algorithms [39]. Our DL framework achieves multi-signals detection and modulation recognition. We use different deep neural networks for different tasks. For signal detection, we use SSD networks. For modulation recognition, we design a multi-inputs CNNs.

### *3.1. SSD Networks for Signal Detection*

We use SSD networks to achieve multi-signals detection. For DL target detection techniques, the existing algorithms are mainly divided into two kinds: algorithms based on region recommendation

(two-stage methods) and algorithms based on regression (one-stage methods). Regression-based algorithms include YOLO series algorithms [40–43] and SSD series algorithms [43,44], while region recommendation-based algorithms include RCNN [45], Fast RCNN [46], and Faster RCNN [47]. In our research, since the regression-based algorithms are faster than region recommendation-based algorithms, we use SSD networks as our signal detection model. SSD networks can generate a series of fixed-size borders and the possibility of the containing target in each border. Finally, the final detection and recognition results are calculated by the non-maximum suppression algorithm [48]. The structure of SSD networks is shown in Figure 5, and it can be divided into four parts.

**Figure 5.** The network model for signal modulation recognition.

**Part 1:** The networks for feature extraction. The initial part of SSD networks is the first layers of VGG16 network, which is used as the primary network to extract the deep features of the whole input image. Behind the primary network, the model structure is the pyramid networks, which contains a series of simple convolution layers to make feature maps smaller and smaller. With the pyramid network structure, we can get several feature maps with different scales.

**Part 2:** The design of the default box. In this part, we will design several feature default box for different scales of feature maps. Each feature map at the top of the VGG16 networks is associated with a set of feature default box. As shown in Figure 6, there are dotted borders at each position of 4 × 4 and 8 × 8 feature maps. These fixed-size borders are default boxes, and their scale parameters are designed by the different feature maps scales. For example, assuming that we need M feature maps to predict, the scale parameters of the default box are as follows:

$$S\_k = S\_{\min} + \left(\frac{S\_{\max} - S\_{\min}}{m - 1}\right) \bullet (k - 1), k \in [1, m] \tag{22}$$

where *S*min is the bottom scale, and *S*max is top scale. The length-width radio of feature default can be expressed as: *ar* <sup>∈</sup> {1, 2, 3, 1/2, 1/3}. So the feature default box length is *<sup>W</sup><sup>a</sup> <sup>k</sup>* = *Sk* <sup>√</sup>*ar* and the width is *Ha <sup>k</sup>* <sup>=</sup> *Sk*/ <sup>√</sup>*ar*.

**Figure 6.** The Design of the default box.

**Part 3:** Detection and Recognition. In this part, we can predict the target category and location. We add a set of convolution kernels behind several different scales feature maps. Using these convolution kernels, we can get a fixed set of test results. For an *m* × *n* × *p* feature maps, a small convolution kernel with 3 × 3 × *p* size is used as the fundamental prediction element. Finally, the classification probability of each feature default box and the offsets are obtained.

**Part 4:** Non-maximum suppression. In the last part, we use non-maximum suppression to select the best prediction results. For the feature default boxes that are matched by each real target border, we calculate their intersection-parallel ratios. The expression is shown as follow

$$\text{IoI} = (A \cap B) / (A \cup B) \tag{23}$$

where A and B are two borders. We will select the feature default box whose *IoU* are greater than 0.5 as best results, and then obtains the highest confidence degree feature default box by non-maximum suppression.

In offline train stage, the whole objective optimal function of the SSD networks includes two parts: confidence loss and location loss. The expression is shown as follow

$$L(\mathbf{x}, \mathbf{c}, l, \mathbf{g}) = \frac{1}{N} (L\_{\text{conv}f}(\mathbf{x}, \mathbf{c}) + \alpha L\_{\text{loc}}(\mathbf{x}, l, \mathbf{g})) \tag{24}$$

where *x* is used to indicate whether the feature default box is a target or not. *N* is the number of the feature default boxes that are matched to real target borders. Parameter α is used to adjust the radio between *Lcon f* and *Lloc*, default α = 1. *Lcon f* is softmax loss function. *Lloc* is used to measure the performance of the boundary box prediction, and in our initial research, we use the typical *smoothL*<sup>1</sup> function to calculate

$$L\_{\rm conf}(\mathbf{x}, \mathbf{c}) = -\sum\_{i \in Pos}^{N} \mathbf{x}\_{i\cdot j}^{p} \log \left( \mathbf{c}\_{i}^{p} \right) - \sum\_{i \in \rm Neg} \log \left( \mathbf{c}\_{i}^{0} \right), \mathbf{c}\_{i}^{p} = \frac{\exp \left( \mathbf{c}\_{i}^{p} \right)}{\sum\_{p} \mathbf{c}\_{i}^{p}} \tag{25}$$

$$L\_{\text{loc}}(\mathbf{x}, l, \mathbf{g}) = \sum\_{i \in \text{Pos}}^{N} \sum\_{\mathbf{m} \in \{\mathbf{c}x, \mathbf{c}y, \mathbf{w}, \mathbf{h}\}} x\_{ij}^{k} smooth\_{L1} \left(l\_{i}^{\text{m}} - \mathbf{g}\_{j}^{\text{m}}\right) \tag{26}$$

where *Pos* and *Neg* represent all positive and negative borders, respectively. *c p <sup>i</sup>* represents the confidence degree for *p*th feature default matching *i*th target. *l m <sup>i</sup>* represents the prediction bias of the i-th feature default box. (*cx*, *cy*) is the box center coordinates and (*w*, *h*) is the box width and height. *g*ˆ*m <sup>j</sup>* represents the deviation between the real target border *<sup>g</sup><sup>m</sup> <sup>j</sup>* and the default box *dm <sup>i</sup>* . *<sup>g</sup>*ˆ*<sup>m</sup> <sup>j</sup>* is calculated as follow:*g*ˆ*cx <sup>j</sup>* = *gcx <sup>j</sup>* <sup>−</sup> *dcx i* /*d<sup>w</sup> <sup>i</sup>* , *g*ˆ *cy <sup>j</sup>* = *g cy <sup>j</sup>* − *d cy i* /*d<sup>h</sup> <sup>i</sup>* , *<sup>g</sup>*ˆ*<sup>w</sup> <sup>j</sup>* <sup>=</sup> log *gw <sup>j</sup>* /*dw i* , *g*ˆ*<sup>h</sup> <sup>j</sup>* <sup>=</sup> log *gh j* /*d<sup>h</sup> i* .

### *3.2. Multi-Inputs CNNs for Modulation Recognition*

For signal modulation recognition task, the modulation set is {BPSK, QPSK, OQPSK, 8PSK, 16QAM, 16APSK, 32APSK, 64QAM}, because they are all belonging to amplitude-phase modulation and we cannot distinguish each other from time-frequency characteristic in SSD network. Hence, we use the eye diagram and vector diagram as the model inputs. The multi-inputs CNNs model is shown in Figure 7. The initial size of the samples is 128 × 128, and we use softmax as the output layer's active function and relu as all other network layers' active function.

The signal features extraction can be divided into three stages. On the first stage, we use 7 × 7 convolution kernels to convolute IQ eye diagram and vector diagram, respectively. To ensure the dynamic range unification of the feature maps, we apply the batch normalization (BN) [49] on first layer network outputs. We perform the max pooling operation on the BN feature maps. Then we connect the feature maps from IQ eye diagram inputs.

**Figure 7.** The network model for signal modulation recognition.

On the second signal feature extraction stage, we adopt the residual network structure to avoid the degradation caused by the network over-depth. The basic structure of ResNet [50] is shown in Figure 8. After the second feature extraction stage, each input feature maps are connected. After the batch normalization in the third stage, we directly process the feature maps by global maximum sampling to reduce the network parameters.

**Figure 8.** The basic structure of ResNet.

In the processing of network optimization, we adopt Adam algorithm [51] to solve the network parameters optimal solution. The categorical cross-entropy error is chosen as the loss function, which is represented as:

$$f\_1(\mathbf{w}\_i \mathbf{b}; \mathbf{x}\_1 \mathbf{x}\_2 \mathbf{x}\_3 \mathbf{y}) = -\sum\_{i}^{N} \left( y\_i \right)^T \log(f\_1(\mathbf{x}\_{1,i}, \mathbf{x}\_{2,i}, \mathbf{x}\_{3,i}; \mathbf{w}, \mathbf{b})) + \lambda\_1 \sum \left\| \mathbf{w} \right\|^2 \tag{27}$$

### *3.3. The Description for Deep Learning Framework*

According to above introduction of the signal detection network and the modulation recognition network, we describe the use of our DL framework. Figure 9 presents the system model. The steps of the model used are as follows:

**Figure 9.** System model.

**Step 1:** We construct the signal detection network and modulation recognition network and train each network with their appropriate samples.

**Step2:** In the online testing phase, we perform STFT for the wideband signals, and use the trained SSD networks to detect the signals in the time-frequency spectrum. In this step, we can obtain the center frequency and start-stop time of each signal. And for MFSK signal, we can get its modulation format.

**Step 3:** For the amplitude-phase modulation signal, we can get the central frequency and start-stop time in step 2. With this knowledge, we filter the target signal and use the envelope spectrum to estimate the signal symbol rate. Then we down convert the signal and perform the matched filter by using the estimated symbol rate.

**Step 4:** If the timing deviation exists in the target signal, it is necessary to extract the sample value at the optimum sampling position for signal eye diagram and vector diagram. We use the non-data-aided timing estimation algorithm in [52]. The specific expression is as follows, where *L*<sup>0</sup> is the length of the signal symbols, *T* is the sampling period and *N* is the oversampling number:

$$\hat{\tau} = \arg \left\{ \sum\_{k=0}^{NL\_0 - 1} \left| s \left( \frac{kT}{N} \right) \right|^2 \mathbf{e}^{-j2\pi k/N} \right\} \tag{28}$$

**Step 5:** We alter the target signal sampling rate, and obtain the baseband signal with a maximum delay 32 sampling period. Moreover, we generate the eye-diagram and vector diagram with the processed signal.

**Step 6:** We use the trained modulation recognition network to identify the signal by its eye diagram and vector diagram. Finally, we complete the signal detection and modulation recognition.

### **4. Results**

In Sections 2 and 3, we have discussed the methods which convert complex signal samples into images without noticeable information loss and introduced the structures of our DL framework. Table 1 shows the time complexity of our DL framework on the different process, in which N is the number of signals in the wideband range. It can be seen that our framework has low time complexity due to the evolution of GPUs, which is acceptable for many practical communication systems.

**Table 1.** The time complexity of the DL framework (ms).


We also have conducted several experiments to demonstrate the performances of the DL framework for joint signal detection and modulation recognition in wireless communication systems. Our experiments can be divided into two parts: (1) performances on multi-signals detection and (2) performances on modulation recognition. The rest of this section is organized as follows

**Multi-signals detection:** First, we show some results of our detection network, and explain the reasons for these results. Then, we evaluate the model performances from three aspects: the modulation format, carrier frequency, and start-stop time. We also compare our network with other detection networks.

**Signal modulation recognition:** We evaluate our model recognition performances on each modulated signal. We also discuss the network performances when the frequency offset exists. Meanwhile, we compare our method with traditional methods and other DL based methods. Finally, we discuss the influence of symbol and eye numbers and compare the performance between signal-input networks and multi-inputs networks.
