*3.2. Ap-Dcn*

3.2.1. Network Architecture of AP-DCN

In this section, AP-DCN, which makes full use of the amplitude and phase information of CSI for multi-location human activity recognition, is designed. The architecture of the proposed AP-DCN is shown in Figure 5. With CSI as the input, in order to efficiently guide the network to learn meaning information, the calculated amplitude and phase are input to the backbone network as the real and imaginary parts of the new complex matrix, respectively.

The network consists of two complex convolution blocks, each of which contains a two-dimensional complex convolution layer, a complex batch normalization layer, and a complex activation function layer. When it comes to the number of network layers, theoretically, the more layers, the more effectively features can be extracted. However, in our scenario, the data samples are limited, and too many layers will easily lead to overfitting of the model. In addition, considering the complexity of the network in general, we designed the network using two complex convolutional blocks. Specifically, the two complex convolutional layers use 32 and 16 complex convolutional kernels, respectively. The kernel size is 3 × 3. Batch normalization layers correspond to 32 and 16 channels, respectively. Rectified Linear Unit (ReLU) is used as the activation function of the network. In order to reduce the number of model parameters and alleviate the over-fitting problem to some extent, the adaptive average pooling is applied to the real part and the imaginary part, and the size of the output feature map is 1 × 1. Subsequently, a complex linear layer follows, which is equivalent to the full connection layer of a real-valued neural network. The input size of the linear layer is 16, and the output size is 5, corresponding to five activity categories. Finally, since human activity recognition is defined as a classification problem, a softmax layer is connected to the end of the network to predict the category of the activity. The details are presented in the following.

**Figure 5.** The architecture of the AP-DCN.

#### 3.2.2. Network Layer of AP-DCN

For the CSI channel matrix *H* = *HR* + *iHI*, *HR* ∈ *R*, *HI* ∈ *R*, where each element is a complex number. The amplitude and the phase can be expressed as:

$$A = \|H\| = \sqrt{H\_R^2 + H\_I^2} \tag{3}$$

$$P = \angle H = \arctan(H\_I/H\_R) \tag{4}$$

As described in the literature [31], to perform the two-dimensional convolution operations in the complex domain, complex filter matrix (complex convolution kernel) *W*=*WR* + *iWI*, where *WR* and *WI* are real matrices, is to be convolved by a complex matrix *C* = *A* + *iP*. The calculation process of complex convolution can be expressed as:

$$\begin{split} \mathcal{W} \* \mathbb{C} &= (\mathcal{W}\_{\mathbb{R}} + i\mathcal{W}\_{\mathbb{I}}) \* (A + iP) \\ &= (\mathcal{W}\_{\mathbb{R}} \* A - \mathcal{W}\_{\mathbb{I}} \* P) + i(\mathcal{W}\_{\mathbb{I}} \* A + \mathcal{W}\_{\mathbb{R}} \* P) \end{split} \tag{5}$$

where ∗ represents the convolution operation. The real and imaginary parts of the convolution operation can be expressed in matrix notation as follows:

$$
\begin{bmatrix}
\Re(\mathcal{W} \* \mathcal{C}) \\
\Im(\mathcal{W} \* \mathcal{C})
\end{bmatrix} = \begin{bmatrix}
\mathcal{W}\_{\mathcal{R}} & -\mathcal{W}\_{I} \\
\mathcal{W}\_{I} & \mathcal{W}\_{\mathcal{R}}
\end{bmatrix} \* \begin{bmatrix}
A \\
P
\end{bmatrix} \tag{6}
$$

where and denote taking the real and imaginary parts of a complex number, respectively. The complex convolution operation is demonstrated in Figure 6.

**Figure 6.** Complex convolution operation.

To ensure the same distribution of the neural network input of each layer in the training process, and to effectively avoid the issue of the training gradient disappearing, complex batch normalization is used for transforming the input value of each layer to standard normal distribution with an average of 0 and variance of 1, so as to accelerate convergence speed of the deep model. Taking the input *x* = {*<sup>x</sup>*1, *x*2,..., *xm*} as an example, the output of the complex batch normalization layer can be obtained via the following process:

$$\mathfrak{x} = (V)^{-\frac{1}{2}} (\mathfrak{x} - \mathbb{E}[\mathfrak{x}]) \tag{7}$$

where the expectation E is calculated as follows:

$$\mathbb{E}(\mathbf{x}) = \begin{bmatrix} \mathbb{E}(\mathfrak{R}(\mathbf{x})) \\ \mathbb{E}(\mathfrak{Z}(\mathbf{x})) \end{bmatrix} = \begin{bmatrix} \frac{1}{m} \sum\_{i=1}^{m} \mathfrak{R}(\mathbf{x}\_{i}) \\ \frac{1}{m} \sum\_{i=1}^{m} \mathfrak{Z}(\mathbf{x}\_{i}) \end{bmatrix} \tag{8}$$

The covariance matrix *V* is

$$\begin{aligned} V &= \begin{pmatrix} V\_{rr} & V\_{ri} \\ V\_{ir} & V\_{ii} \end{pmatrix} \\ &= \begin{pmatrix} \operatorname{Cov}(\Re\{\mathbf{x}\}, \Re\{\mathbf{x}\}) & \operatorname{Cov}(\Re\{\mathbf{x}\}, \Im\{\mathbf{x}\}) \\ \operatorname{Cov}(\Im\{\mathbf{x}\}, \Re\{\mathbf{x}\}) & \operatorname{Cov}(\Im\{\mathbf{x}\}, \Im\{\mathbf{x}\}) \end{pmatrix} \end{aligned} \tag{9}$$

where *Cov* implies the covariance calculation. Take *Cov*((*x*), (*x*)) as an example

$$\text{Cov}(\mathbb{G}(\mathbf{x}), \mathbb{R}(\mathbf{x})) = \frac{\sum\_{i=1}^{m} \left( \mathbb{G}(\mathbf{x}\_{i}) - \mathbb{E}(\mathbb{G}(\mathbf{x}\_{i})) \right) \left( \mathbb{R}(\mathbf{x}\_{i}) - \mathbb{E}(\mathbb{R}(\mathbf{x}\_{i})) \right)}{m} \tag{10}$$

In order to maintain the original feature distribution, the scale transformation and translation transformation follow the calculation process in reference [31].

The complex ReLU (CReLU) activation function is applied on both the real and the imaginary part of a neuron. For a complex input *z*, it is given by

$$\text{CReLU}(z) = \text{ReLU}(\Re(z)) + i \text{ReLU}(\Im(z)) \tag{11}$$

The complex linear layer is computed similarly to the complex convolution operation by replacing the convolution operation with the multiplication operation. Then, the module value of complex output *z* of the complex linear layer is calculated to obtain:

$$z' = \sqrt{\left(\Re(z)\right)^2 + \left(\Im(z)\right)^2} \tag{12}$$

In the training phase, cross-entropy loss is employed. Letting *L* denote a real-value loss function, the back-propagation (BP) can be written as:

$$\begin{split} \nabla\_L(H) &= \frac{\partial L}{\partial H} = \frac{\partial L}{\partial H\_R} + i \frac{\partial L}{\partial H\_I} = \frac{\partial \mathcal{R}(H)}{\partial \mathcal{R}(H)} + i \frac{\partial \mathcal{G}(H)}{\partial \mathcal{G}(H)} \\ &= \Re(\nabla\_L(H)) + i \Im(\nabla\_L(H)) \end{split} \tag{13}$$

$$\begin{split} \nabla\_L(\mathcal{W}) = \frac{\partial L}{\partial \mathcal{W}} &= \frac{\partial L}{\partial W\_R} + i \frac{\partial L}{\partial W\_I} \\ &= \Re(\nabla\_L(H)) \left( \frac{\partial H\_R}{\partial W\_R} + \frac{\partial H\_R}{\partial W\_I} \right) \\ &+ \Im(\nabla\_L(H)) \left( \frac{\partial H\_I}{\partial W\_R} + \frac{\partial H\_I}{\partial W\_I} \right) \end{split} \tag{14}$$

The loss function is minimized with Adam [33] to optimize the network parameters. The exponential decay rate *ρ*1 and *ρ*2 are empirically set as 0.9 and 0.999, and the learning rate is set as 0.001. ReduceLROnPlateau learning rate policy is utilized. The learning rate

will be reduced by half when there is no improvement in the training loss over eight epochs. The total epoch is set up as 50.

## *3.3. Dcn-Tl*

3.3.1. Human Activity Recognition Method Based on DCN-TL

The process of the DCN-TL-based sensing method mainly consists of feature representation and recognition, and model fine-tuning based on transfer learning. A feature representation and classification recognition method based on a one-dimensional complex convolutional network was proposed to extract features and predict categories of human activities. The network is trained through the human activity samples with sufficient data in the source domain to obtain the pre-trained model. Then, a small number of target domain samples are used to update the pre-trained model by the transfer learning method. In practical application scenarios, the model pre-training and the model update are completed offline. After the optimal parameters are obtained, the activity prediction can be realized online without affecting the system response speed during use.

3.3.2. Feature Representation Method Based on One-Dimensional Complex Convolution

Before knowledge transfer, it is necessary to learn as much activity-related experiential knowledge as possible from the data samples in the source domain. In order to effectively mine the time-dimension information of CSI data, a human activity feature representation method based on a one-dimensional complex convolutional network is proposed. The activity features contained in the amplitude and phase of CSI data are extracted by onedimensional convolution which is suitable for sequence information extraction. Table 1 shows the structure of the feature extraction network model. The 750 × 30 complex CSI matrix is used as the input, and its amplitude and phase are calculated to input the backbone network. The network consists of two one-dimensional complex convolutional layers, two complex batch normalization layers, an adaptive averaging pooling layer, and two complex linear layers. In Table 1, (×4) represents the convolution operation or linear multiplication operation between the real/imaginary parts and the two corresponding network weights four times. (×2) represents two corresponding operations on the real and imaginary parts. Complex convolution operations and complex linear operations are computed in the same way as AP-DCN. The softmax classifier is still used for classification and recognition. Specific network parameters are set as follows: the number of convolution kernels corresponding to the two convolution layers is 128; the kernel size is 3; for onedimensional convolution, that is, three times the number of input channels. For example, the size of the convolution kernel at the first layer is 3 × 30, and the size of the convolution kernel at the second layer is 3 × 128. The step size and the padding are set to 1.


**Table 1.** The model structure of feature extraction network.

Figure 7 shows the specific one-dimensional convolution operation process for the real or imaginary part of the complex input matrix. Taking the amplitude or phase of CSI

data with an input size of *T* × *s* as an example, it is composed of *T* time slices, and each time slice corresponds to *s* subcarrier information, which can be regarded as the feature vector with the dimension of *s*.

The convolution kernel is used to conduct a one-dimensional convolution operation with input data. Different from the two-dimensional convolution operation, the convolution kernel of the one-dimensional convolution operation moves only along the time axis. The convolution kernel is a feature detector, which is equivalent to a sliding time window in the time dimension. We define the number of convolution kernels as *N* and the size of the convolution kernels as *k*. The number of convolution kernels determines the dimension of the output vector, which is the number of features obtained. The size of the convolution kernel determines the time length of the activity involved in each convolution operation. The length of the input data and the size of the convolution kernel determine the number of output neurons. Taking step size 1 and padding size 0 as an example, after a layer of convolution operation, the output matrix with size *N* × (*T* − *k* + 1) is obtained. For the network mentioned above, the same loss function calculation method, model optimization method, and parameter setting are still used to train the model and obtain the pre-training model for the next stage of knowledge transfer.

**Figure 7.** One-dimensional convolution operation.

#### 3.3.3. Recognition Method Based on Transfer Learning

The above feature representation and learning methods can be used to train the basic model with strong discriminant ability from relatively sufficient source domain data. At this point, the learned knowledge contains the basic characteristics of CSI data and the general characteristics of activities at different source domain locations. When data samples from different locations are unbalanced, to adapt the model to the target domain location where the data sample is further constrained, the model needs to have the ability to transfer knowledge learned from the source domain locations to the target domain locations. Therefore, a multi-location activity recognition scheme based on transfer learning is proposed. The model fine-tuning of transfer learning can realize knowledge sharing with very few target domain samples. The low-level parameters of the network are obtained from sufficient source domain data, and the high-level parameters are learned from the target domain data with limited samples.

The transfer learning scheme is based on the pre-training model. Figure 8 shows the architecture of the transfer learning network. The specific process is as follows: Firstly, the network model is pre-trained using the source domain training data set composed of several positions to obtain the optimal model parameters. These parameters are then used to initialize the network and freeze the network layer before the linear layer. Finally, the two linear layers are trained with very few data samples from the target domain locations. Based on the pre-training model, the activity feature representation of source domain learning is transferred to the target domain, which greatly reduces the need for training samples in the target domain and effectively alleviates the problem of sample imbalance. The forward and back-propagation of traditional network training involve all layers of the

whole network, while the transfer learning process only involves the last two layers, which can effectively reduce training parameters and shorten training time.

**Figure 8.** Architecture of transfer learning network.

#### **4. Experiment and Evaluation**

In order to validate the performance of our proposed AP-DCN-based and DCN-TLbased multi-location HAR method, a series of experiments have been conducted. The experiment setup and the experiment results are reported in this section.

#### *4.1. Experiment Setup*

To fully evaluate the performance of the proposed method, a dataset has been collected in a cluttered office. The experimental scene is shown in Figure 9. Halperin et al. develop Linux 802.11n CSI Tool [34] based on Intel 5300 Network Interface Card (NIC) which is leveraged to acquire the fine-grained CSI data. The transmitter (TX) and the receiver (RX) work in 802.11n, and operate on a 5 GHz frequency band, with a bandwidth of 20 MHz. They are both equipped with three antennas. CSI with 30 groups of subcarriers from each TX-RX pair can be obtained. It is worth noting that the CSI data from only one of the antenna pairs, namely 30 subcarriers, can be alternatively used.

**Figure 9.** Data collection experimental scene.

To explore the multi-location HAR method, data samples at 24 different locations within a region between the transceivers are collected. The specified location is depicted

in Figure 10. The distance between adjacent sampling locations is approximately 0.6 m. The room size is approximately 6 m × 8 m. The distance between TX and RX is 4 m. We predefined five activities, including drawing a circle (O), drawing across (X), lifting up and laying down two arms (UP), pushing and opening with two arms (PO), and sitting down (ST). Five volunteers (one female and four males ranging from 23–30 years old) conducted 50 samples for each activity at each location. There are 24 × 50 = 1200 samples for each activity of each person. Since the initial sampling rate is 200 frames per second, and the actual duration of the actions is 3.5∼4 s, namely 700∼800 frames, 750 frames is cut as a sample.

**Figure 10.** The layout of data collection locations.

#### *4.2. Experiment Results of Ap-Dcn*

The evaluation contains the following three parts. Firstly, the feasibility and effectiveness of the approach are explored. Then, the reliability is discussed. Finally, the proposed method is compared with other approaches to prove the superiority of our system.

**Overall performance.** To verify the feasibility of the multi-location sensing method, 50 samples for each activity at 24 locations of one person are randomly divided into three parts, the training set, the validation set, and the testing set, which accounts for 20%, 20%, and 60%, respectively. It is worth noting that, to reduce the computational burden, only 30 subcarriers from one TX-RX antenna and five training samples from each location are used. The size of the sample is 750 × 30, each is a complex number with its real and imaginary parts. The average accuracy of the proposed method for the five activities of one person is 96.53%. The confusion matrix is demonstrated in Figure 11. It can be seen that all the activities can obtain an acceptable recognition accuracy. In particular, the activity ST achieved 99.86% recognition accuracy. Since X and O are both movements in front of the body after raising the right arm, they are easier to be confused than other activities. In summary, our proposed method performs well in multi-location human activity sensing.

The enhancement effect of amplitude and phase information on multi-location recognition is analyzed. The comparison of recognition accuracy of different methods is demonstrated in Table 2. CNN represents the real-valued convolutional neural network corresponding to AP-DCN network structure. DCN represents complex convolutional networks with the same network structure that are not enhanced by amplitude and phase calculation. Through the comparison between CNN and DCN, it can be seen that complex convolution calculation plays a certain role in extracting richer activity information. The comparison between DCN and AP-DCN shows that manual calculation of amplitude and phase can effectively guide the network to learn more accurate information, so as to achieve higher accuracy of human activity perception.

**Figure 11.** The confusion matrix of AP-DCN recognition accuracy.



**Performance of multi-location HAR in terms of different location areas.** According to the principle of wireless sensing, when the target is farther away from the transmitter and the receiver, the delay of the reflected signal generated by the target may be larger. After multi-path superposition, the influence on the received signal is relatively small. Therefore, when the target and the perceptual location area are far from the transceiver, the sensing effect will decrease. To evaluate the reliability of the proposed method in different location areas, four perceptual regions from near and far relative to the transceiver are selected. Table 3 shows the recognition accuracy of different perception areas. Loc1-Loc6 indicates the training and testing samples are selected from location 1–6 in Figure 10. As can be seen, as the location region expands, although the perceptual effect slightly declines, high recognition accuracy can be obtained in each perception area. Although the perception effect is slightly decreased, high recognition accuracy can be obtained in each perception area. For 24 sampling positions covering almost the whole space, the recognition accuracy is still satisfied.

**Table 3.** The AP-DCN recognition accuracy of different sensing areas.


**Performance of multi-location HAR for different number of training samples.** Intuitively, the more samples involved in training, the richer the activity features can be provided. The number of training samples involving 4, 6, 8, and 10 for each activity at each location are investigated. The recognition accuracy with different numbers of training samples is shown in Table 4. As can be seen, the proposed method can provide satisfied

recognition accuracy of 95.81% with four training samples. When the number of training samples increases from four to ten, the recognition accuracy is further improved.

**Table 4.** The recognition accuracy for different number of training samples.


**Performance of multi-location HAR for different number of subcarriers with different sampling rates.** This paper, in addition to implementing human activity recognition for multi-location, also aims to reduce the computational burden, which is more suitable for real-time applications. Therefore, a small sample size is desired. CSI measurements are collected at the initial transmission rate of 200 packets per second, and the 750 CSI series are down-sampled to 375, 250, 150, and 75. Furthermore, the number of subcarriers of 10, 20, and 30 are investigated. It is worth noting that only five training samples for each activity at each location are utilized. The recognition accuracy with different numbers of subcarriers and sampling rates are shown in Figure 12. As can be seen, the proposed method can provide satisfied recognition accuracy with very few subcarriers and low sampling rates. As far as the sampling rate is concerned, when the sampling rate decreases to 20 frames/s, the method can still obtain 88.61% with only 10 subcarriers.

**Figure 12.** The recognition accuracy with different number of subcarriers and sampling rates.

**Performance of multi-location HAR for different persons.** To verify the reliability of the system for different users, we collected the data samples involving five subjects marked as User1–User5. Their heights range from 160–180 cm, while the age is from 23–30 years old. The recognition results of the five users for five activities at 24 locations are shown in Table 5. As illustrated, the average recognition accuracy is 96.85%. Consequently, our method can work well for different users.

**Table 5.** The AP-DCN recognition accuracy for different users.


**Comparison with different recognition methods.** In this part, to evaluate the superiority, four typical approaches are compared with our system. ActNet [26] is the state-of-the-art multi-location HAR method, which decomposes the input samples into the location-irrelevant activity features and activity-irrelevant location features. It jointly learns different activities from multi-locations to mitigate the issue of insufficient data. SqueezeNet [35] and Alexnet [36] are two classical deep learning methods. WiHand [37]

utilizes the low rank and sparse decomposition (LRSD) algorithm to separate activity signal from background information, thus making it adapt to location variation. It is worth noting that, in order to keep the settings as similar as possible to the original literature, all five methods use 10 training samples. In addition, the first three use 270 subcarriers, while the last two use 30 subcarriers. As can be seen in Table 6, our system outperforms these three methods in multi-location HAR, even using fewer subcarriers.


**Table 6.** Comparison with different recognition methods.

#### *4.3. Experiment Results of Dcn-Tl*

This section still uses the data set composed of five human activities collected at 24 positions in Section 4.1 to evaluate the performance of the multi-location human activity recognition method based on DCN-TL. 50 samples of each activity collected by volunteers at each location are divided into three parts: model training, knowledge transfer, and performance test, accounting for 60%, 20%, and 20%, respectively. For any volunteer, a maximum of 30 training samples, 10 transfer samples, and 10 test samples are available for an activity at each location. This section still uses 750 frame length and 30 subcarriers as input. The parameters of the pre-training network are the same as in the previous section.

**Overall performance.** In order to verify the perceptual performance of the method when the number of training samples at different locations is unbalanced and the number of samples at some locations is further limited, for the 24 sampling locations in the data set, we take the example of sufficient samples at six locations and insufficient samples at other locations to evaluate the feasibility of the method. The six training positions are selected starting from the first position in Figure 10 at equal intervals, taking one for every four positions from location 1 to 24. Three samples were randomly selected from 10 transfer samples for model transfer learning. The testing set consists of testing samples involving five activities at 24 positions, with a total of 24 × 5 × 10 = 1200 samples. Experimental results show that the average recognition accuracy of DCN-TL is 93.00%. The confusion matrix is shown in Figure 13. Among them, ST can obtain 100% recognition accuracy. Other activities can also obtain satisfactory recognition accuracy. Therefore, DCN-TL performs well in the multi-location human activity sensing when the number of training samples at different positions is unbalanced and the number of samples at some positions is further limited.

**Performance of multi-location HAR in terms of different location areas.** We discuss the performance of human activity recognition based on the DCN-TL recognition method when the perception area gradually expands. At the 24 sampling positions shown in Figure 10, location 1–6 in the first row parallel to the transceiver is defined as perception area 1, and the experiment number is marked as N1. Training positions 2 and 5 are symmetrically selected. One row is added at a time to gradually expand the perception area, forming evaluation experiments numbered N2, N3, and N4. The training position of the latter perception area is increased based on the training position of the former perception area. For example, "N1+8/11" represents that the training positions 8 and 11 are added on the basis of N1, namely, the four positions participating in the model pre-training are 2/5/8/11. Table 7 shows the recognition accuracy of different perception areas. It can be seen that, with the expansion of the perception area, the recognition accuracy gradually improves. This is because the model can learn more knowledge in the pre-training stage due to the gradual increase of training positions.

**Figure 13.** The confusion matrix of DCN-TL recognition accuracy.



**Performance of multi-location HAR in terms of different number of training locations.** The number of positions involved in pre-training is a critical factor affecting perceptual performance. The influence of the number of locations involved in pre-training on recognition accuracy is discussed in this part. A total of 12, 8, 6, and 4 pre-training positions are sampled at equal intervals from 24 positions. As shown in Table 8, when only four positions participate in training, and three samples are provided for each transfer location, the recognition accuracy can still be 90.42%. When there are 12 training positions, the recognition accuracy is 94.64%. With the increase of the number of training positions, the recognition accuracy increases gradually.

**Table 8.** The recognition accuracy of different numbers of pre-trained locations.


**Performance of multi-location HAR in terms of different number of transfer samples.** The influence of the number of samples involved in knowledge transfer on recognition accuracy is discussed. This part takes six pre-training positions as examples and tests the method at 24 positions. A total of 1–5 transfer samples are randomly selected from the transfer sample set. The recognition accuracy is shown in Table 9. When only one transfer sample is provided, the recognition accuracy is 90.55%. Using five transfer samples, the recognition accuracy can reach 97.44%. With the increase in the number of transfer samples, the model can learn more activity characteristics of the target domain location based on the experiential knowledge of source domain location learning, so as to improve the recognition accuracy.


**Performance of multi-location HAR for different users.** To verify the reliability of the proposed method for different users, the performance of human activity data involving five volunteers are evaluated. Five volunteers are marked as User 1–User 5. Six positions are used for training, 24 positions are tested, and three transfer samples are provided for each position. Table 10 shows the recognition accuracy of 5 activities at 24 positions by different volunteers. The average recognition accuracy is 94.02%. Experimental results showthatthismethodbewellappliedtodifferent

 **Table 10.** The DCN-TL recognition accuracy for different users.

 can


 users.
