1. Introduction
Sign language is an important way for deaf people to understand and communicate with each other. Communication barriers are often encountered between the deaf communities and people who do not know about sign language. Many researchers try to build a sign language recognition system to break these barriers [
1]. Currently, sign language recognition systems are roughly divided into two categories: (i) device-based sign language recognition systems; (ii) device-free sign language recognition systems [
2,
3].
Wearable sensors are widely used in device-based sign language recognition systems. In 1983, Grimesws et al. invented a data glove for dynamic gesture recognition [
4]. Shukor et al. used data gloves to obtain data on Malaysian sign language letters, numbers, and words [
5]. Kanokod et al. recognized gestures through the time delay neural networks (TDNNs) algorithm, and the gesture data is obtained from data gloves based on pyrolytic graphite sheets (PGS) [
6]. In general, the advantages of the wearable device-based sign language recognition method are that the input data is accurate and the recognition rate is high [
2,
4,
5,
6]. The disadvantages are also obvious. For example, wearable devices are often expensive and inconvenient to carry.
The device-free sign language recognition systems are usually inexpensive and not limited by the wearable device [
2,
7]. Several device-free sign language recognition systems use computer vision techniques with cameras. Koller conducted a survey on gesture recognition, focusing on the RWTH-PHOENIX-Weather data set recorded by a stationary standard color camera [
8]. For example, Cui et al. adopted a convolutional neural network (CNN) with stacked temporal fusion layers and a bi-directional recurrent neural network (RNN) to extract the spatiotemporal information of sign language videos [
9]. Pu et al. proposed an alignment network with iterative optimization for video-based sign language recognition, including a three-dimensional (3D) ResNet and an encoder-decoder network [
10]. In addition, depth information is also considered to improve the performance. Ohn-Bar et al. introduced a vision-based system consisting of RGB and depth descriptors to classify gestures. This method used RGB and depth images to modify the Histogram of Oriented Gradients (HOG) function to achieve higher classification accuracy [
11]. Huang et al. proposed a 3D CNN which extracts automatically spatial-temporal features from raw video datasets collected with Microsoft Kinect sensor [
12]. Aly et al. utilized an unsupervised principal component analysis network (PCANet) to extract the local features of sign language images captured from the Microsoft Kinect sensor. The extracted features were recognized by the support vector machine (SVM) [
13]. It is worth noting that computer vision-based sign language recognition systems are not only susceptible to light conditions and obstacles, but also cause privacy issues [
2,
7].
With the widespread deployment of wireless networks, wireless related technologies have experienced very rapid growth. Gesture recognition based on commodity Wi-Fi sensing solutions has been widely studied [
2,
7]. Thus, a sign language recognition system that is non-intrusive and insensitive to lighting conditions has attracted the interest of many researchers. The gestures recognition system proposed by Melgarejo et al. used directional antenna technology to achieve fine-grained gesture recognition [
14]. Shang and Wu used Wi-Fi signals to recognize different gestures and arm movements, called WiSign [
15]. Wi-Finger, which can recognize nine-digit finger gestures from American Sign Language (ASL), was implemented on a commercial Wi-Fi infrastructure [
16].
Most Wi-Fi sensing research uses Wi-Fi channel state information (CSI), which describes how the signal propagates from the transmitter to the receiver, and reveals combined effects such as scattering, fading, and power decay with distance. Zhou et al. collected the number of digital gesture information through CSI-based technology, and the recognition accuracy of these gestures can reach 96% through deep learning [
17]. Ma et al. collected CSI traces for 276 sign gestures that are frequently used in daily life. The CSI dataset is called the SignFi dataset. The proposed method is called the SignFi method, which is a nine-layer CNN model for recognizing sign language gestures from laboratory, home, and mixed lab and home environments [
1].
However, the challenges of sign language recognition based on Wi-Fi sensing come from three aspects. First, the CSI signal is mixed with background noise, signal interference, and multipath noise. Second, the diversity, complexity, and similarity of gestures make it difficult to distinguish. Third, sign language includes the positional relationship of gestures in space and the changes of actions over time. Ahmed et al. proposed a higher order statistics-based recognition (HOS-Re) model to extract higher order statistical features from SignFi dataset and select a robust feature subset as input to a multilevel support vector machine (SVM) classifier [
3].
Our goal is to recognize gestures based on the SignFi dataset. We propose a novel gesture recognition method that combines singular value decomposition (SVD), dual-output two-stream network, and attention mechanism. SVD is used in the data preprocessing to reduce the noise to a certain extent. The proposed dual-output two-stream network, which combines a spatial-stream network and a motion-stream network, extracts spatial and temporal information, and classifies input patterns. The attention mechanism in deep learning is inspired by the mode of human attention thinking and has been widely used in natural language processing. In our work, the attention mechanism is utilized to select the important features from our dual-output two-stream network. In this study, we find that the additional auxiliary output can effectively alleviate the backpropagation problem of the two-stream network and improve its recognition accuracy.
In summary, the contributions of work are as follows:
This work shows how to process sign language data based on CSI traces through SVD. It not only makes sign language features more prominent, but also reduces noise and outliers to a certain extent. SVD helps to improve the recognition accuracy of the two-stream network, and has the characteristics of fast running, robustness, and generalization ability.
We explored a novel scheme, dual-output two-stream network. The two-stream network consists of a spatial-stream network and a motion-stream network. The input of the spatial stream network is a three-dimensional array (similar to an array of RGB images) composed of the amplitude and phase of each gesture. The array differences, which represent the amplitude and phase changes, are fed into the motion stream network. The convolutional features from the two streams are fused, and then an attention mechanism automatically selects the most descriptive features. The experimental results show that the dual output can effectively alleviate the back propagation problem of two-stream CNN and improve the accuracy.
The fine-tuning of an ImageNet pre-trained CNN model on CSI datasets has not yet been exploited. We explored CNN architectures with different model layers on CSI data.
2. Materials and Methods
2.1. Received Signal Strength Indicator and Channel State Information
The principle of wireless indoor behavior detection is to transmit the generated wireless signal through multipath transmission. Reflection and scattering will cause multiple superimposed signals to be received in an indoor environment. These signals are physically affected by human behavior in the transmission space and generate various environmental characteristic information. Therefore, the information extracted from multipath superimposed signals can be used to identify human behavior [
18].
The most common data sources for the device free gesture recognition systems based on Wi-Fi signals are the Received Signal Strength Indicator (RSSI) and CSI [
19]. RSSI is the most widely used signal indicator for wireless devices [
20]. It describes the attenuation experienced during the propagation of a wireless signal. In the wireless sensor link, the RSSI of the wireless sensor unit will be changed with the movement of the person. In other words, the movement of the person can be detected based on the change of RSSI. Since the RSSI information is easy to capture, RSSI was used for hand gesture recognition in the early days [
21]. The RSSI is a kind of coarse-grained information, mainly from the superimposition result of the receiver during signal transmission. It is affected by the multipath effect and environmental noise, so it has large fluctuations and poor stability [
2].
RSSI only reflects the total amplitude of multipath overlap on the media access control (MAC) layer, while CSI is more fine-grained subcarrier information of the physical layer. For a multiple-input multiple-output (MIMO) wireless technology in combination with orthogonal frequency division multiplexing (OFDM) Wi-Fi system, CSI is mainly derived from the sub-carriers decoded by the OFDM [
22]. It can effectively eliminate or reduce the interference caused by the multi-path effect. CSI contains amplitude and phase information under different sub-carriers, and each sub-carrier does not interfere with each other. Thus, CSI is more sensitive and reliable than RSSI. It has higher detection accuracy and sensitivity, so it can achieve more detailed motion detection [
23].
A set of CSI data can be obtained from each received data packet of a wireless network card compatible with the IEEE 802.11n protocol standard. The amplitude and phase of a sub-carrier on the CSI data are shown in Equation (1):
where
H(
fk) is the CSI of the sub-carrier with a center frequency of
fk, ‖
H(
fk)‖ and ∠
H(
fk) are the amplitude and phase of the center frequency of
fk, respectively. They are the most important information in CSI data, and
k represents the total number of sub-carriers.
2.2. Singular Value Decomposition
In linear algebra, singular value decomposition (SVD) is the factorization of real or complex matrices [
24]. SVD is a decomposition method that can be applied to any matrix. There is always SVD for any matrix A, as shown in Equation (2):
Assuming that A is an real or complex matrix, the obtained is an square matrix, and the orthogonal vector in is called a left singular vector. is an rectangular diagonal matrix. Except for the diagonal elements, all elements of are 0. The elements on the diagonal are called singular values. VT is the transposed matrix of V, which is an square matrix. The orthogonal vector is called the right singular value vector.
Generally speaking, the values on the
are in descending order [
24]. The larger the value, the higher the importance of the dimension. We can choose the top singular values to approximate the matrix. This way not only extracts important features from the data, but also simplifies the data and eliminates noise and redundancy. The number of singular values depends on various factors, such as different datasets, recognition methods, and temporal and spatial characteristics.
2.3. SignFi Dataset
The SignFi dataset contains the CSI data, which were extracted by the 802.11n CSI-Tool on the Intel WiFi Link 5300 device with three antennas. The dataset was collected through a transmitter with three external antennas and a receiver with one internal antenna.
Figure 1 shows the measurement scenes of the lab and home environments. The 802.11n CSI-Tool provides CSI values of 30 sub-carriers, which were sampled approximately every 5 milliseconds. The duration of each gesture is about 1 s, so there were 200 CSI samples for each gesture. The CSI data was stored as a 3D matrix of complex values representing amplitude and phase information. The size of the 3D matrix is
. The 3D amplitude and phase matrices are similar to digital images with spatial resolution of H × W and C color channels. Thus, the CSI data can be regarded as images. The three color channels correspond to the three antenna signals.
The SignFi dataset consisted of two parts. The first part included 276 gestures, a total of 8280 instances from the same user. Among them, 5520 instances and 2760 instances were obtained in the laboratory and home. Each gesture had 20 and 10 instances in the laboratory and home, respectively. The second part included 150 gestures with 7500 instances collected from five users in the laboratory, 50 instances of each gesture and 10 instances of each user. The dataset was further divided into four groups to train and evaluate our method including Home276, Lab276, Lab+Home276, and Lab150 groups. The number of gestures were 276 and 150.
Table 1 shows the statistics of the SignFi dataset.
2.4. Data Preprocessing
The amplitude and phase can be obtained from the 3D matrix of the raw CSI. Their size is
. The amplitude and phase of an antenna can be obtained from Equations (3) and (4):
Note that we directly get the angular degree value of the phase without unwrapping the phase to eliminate the phase shift like SignFi method [
1]. We combined the amplitude and phase of each gesture and reshaped it into a combination matrix with a size of
as the input data of the spatial-stream network. The input of the motion-stream network was the difference (size of
) of the above combination matrix, which was a concatenation of amplitude difference and phase difference. The difference comes from two consecutive instances and describes the changes in amplitude and phase. It indicates the change of the gesture corresponding to the salient area of movement. Then, two types of modality data, namely combination matrix and their difference matrix, were preprocessed by SVD to remove redundant and irrelevant noise.
Figure 2 shows the combination matrix and difference matrix before and after SVD preprocessing of sign language “GO” in the home and laboratory environments. Each picture in
Figure 2 represents a 3D matrix.
Figure 2a,b,e,f are the combination matrices with a size of
, and
Figure 2c,d,g,h are the difference matrices of a size of
. The Y axis represents the first dimension, and the X axis represents the second dimension. The RGB color is the third dimension representing three antenna signals. On the X axis, the first half (0–29) is the amplitude information and the second half (30–59) is the phase information.
From
Figure 2, we observed: (1) Combination matrices are more colorful than difference matrices. The color channels correspond to the three antenna signals. The richer the color, the greater the diversity of the signal, and the more information it contains. Thus, the combination matrices contain more information than the difference matrices. (2) The same user performs the same gesture, and the difference between the home and laboratory environment results in different CSI data, especially in the combination matrices, as shown in
Figure 2a,b,e,f. In other words, the amplitude and phase of CSI are easily affected by the environment. However, the difference matrices are less affected by the environment. (3) We perform SVD preprocessing on the amplitude and phase of CSI data, respectively. In order to strike a balance between data feature integrity and noise elimination, SVD only selects the top 20 of the 30 singular value rankings, namely SVD_20. From
Figure 2c,d,g,h, we can know that the matrix signal becomes smoother after performing SVD.
2.5. Dual-Output Two-Stream Convolutional Neural Network
Convolutional Neural Network (CNN) is the most successful neural network in the field of deep learning [
25]. The network avoids the complicated processing of images and can directly input the original images to achieve end-to-end results. CNN is derived from Hubel and Wiesel’s study of the cat brain visual system in 1962 [
26]. In 1998, Yann Lecun proposed the LeNet-5 network to solve the visual task of handwritten digit recognition [
27]. In the 2012 ImageNet image recognition competition, Hinton used AlexNet to greatly improve the accuracy of image recognition and subvert the field of image recognition [
28]. This made CNN attract much attention and has become a research hotspot. In order to improve the performance of CNN, several improved CNNs have been proposed, such as ZFNet [
29], VGGNet [
30], GoogleNet [
31], ResNet [
32], DenseNet [
33], and ResNeXt [
34]. These networks focus on three important aspects: depth, width, and cardinality. At the same time, the CNN network structure has been developed in terms of attention mechanism, efficiency, and automation. The most famous are SENet [
35], CBAM [
36], SqueezeNet, MobileNet [
37], NASNet [
38], and EfficientNet [
39].
In video behavior recognition, Simonyan et al. proposed a two-stream CNN structure for RGB input and optical flow input [
40]. They used two identical CNN structures for training and merged through a post-fusion method. This is an effective way in the field of behavior recognition when the training dataset is limited. A lot of research has been conducted based on this architecture [
41,
42,
43]. For example, Wang et al. proposed a time segmentation network (TSN), which divides the input video into several segments and sparsely samples two-stream features from these segments [
41]. Feichtenhofer et al. extended the two-stream CNN and proposed a spatio-temporal CNN [
42].
Figure 3 shows the architecture of our proposed sign language recognition method, which combined SVD, a dual-output two-stream network, and an attention mechanism. The SignFi dataset was collected by CSI measurement. The raw CSI data are a 3D matrix that can be regarded as an image. Thus, computer vision techniques and CNN models can be used to process the CSI data.
The amplitude and phase information contain noise and a certain phase offset. In our method, SVD was first used to remove redundant and irrelevant noise in amplitude and phase. Then, they were concatenated and converted to a 3D matrix, which is similar to an array of RGB images. After the SVD processing, the resulting matrix was fed into the spatial-stream CNN, which is the top layer of our dual-output two-stream network. Sign language includes not only the positional relationship of gestures in space, but also the changes of actions over time. We introduced the amplitude difference and phase difference information, which represented changes in amplitude and phase respectively. The difference matrix was input to the motion-stream CNN, which is the bottom layer of our dual-output two-stream network.
The proposed dual-output two-stream network is shown in
Figure 4. In total, two types of modality data, combination matrix and difference matrix, were input into the network. In this study, the ResNet model was used for two stream CNNs. The convolutional layers in CNNs extracted multiple levels of features. When two streams are fused by concatenation, the attention mechanism (CBAM) [
36] module will automatically select the most descriptive features learned by the two stream networks. Then, batch normalization (BN) is used to prevent overfitting. The ensemble prediction was the final output, as shown in the bottom layer of
Figure 4. The two cross-entropy losses were combined to optimize the learning process. The dual-output and two cross-entropy losses in this structure mainly borrowed the ideas of GoogleNet architecture. The additional classification network mainly provided gradient training for the previous convolution. When the network deepens, the gradient cannot be effectively transmitted from back to front, and network parameters cannot be updated. Such a branch can alleviate the gradient propagation problem.
Most CNNs provide the pre-trained models based on the ImageNet dataset. For the pre-trained models, the CSI data is unknown. The transfer learning allowed us to use a small amount of newly labeled data to build a high-quality classification model for the new data. Therefore, we used the transfer learning to fine-tune the pre-trained CNN models to speed up the training and improve the accuracy. In transfer learning, we freeze the first five layers of the pre-trained model and train the remaining layers. In this way, we retain the generic features learned from the ImageNet data set, and also learn domain knowledge from the CSI data.
4. Discussion
Deep learning models generally have achieved greater success due to the availability of massive datasets and extended model depth and parameterization. However, in practice, factors such as memory and computation time during training and testing are important factors to consider when choosing a model from a large number of models. In addition, the success of deep learning also depends on the training data and the model generalization, which is very important for deploying models in practical use because it is difficult to collect training data and train individual models for all different environments. In other words, the generalization capability is more important for practical use. According to the evaluation results shown in
Table 5, our method has better practicability than other methods.
The diversity of input data is very helpful in CNN-based methods. CNN-based methods extract features through training. Input diversity means that CNN can extract more types of features. This can avoid overfitting during network training. Therefore, the proposed dual-output two-stream network uses two modalities of input data and achieves good performance. Moreover, input data containing redundant and irrelevant noise must be preprocessed. This can be proved in the above experiments.
Table 3,
Table 4,
Table 5 and
Table 6 show that SVD preprocessing can improve the performance of our dual-output two-stream network.
In this study, the experimental results also show that deep learning is not always successful. It can be seen from
Table 6 that the HOS-Re method obtains the best result. This method is a traditional machine learning method. It relies on manual feature engineering to calculate a large number of features and use SVM as a classifier. The method is different from CNN-based methods such as SignFi method and our method, which can automatically extract features through training. Through this evaluation, we can know that as long as good functions can be found, traditional machine learning based on feature engineering is still worthy of attention.