1. Introduction
Wearable devices have been widely popularized and applied in recent years. Their portability, real-time, and intelligence characteristics make them indispensable to people’s lives. With the popularity of these devices, the user information collected by various built-in sensors has also caused users to worry about the security of data information. The information collected by wearable devices covers many aspects of users, including sensitive data such as daily behaviors, location trajectories, environmental information, and physiological health. Once this information is leaked or illegally obtained, it may be used for various illegal activities, which will seriously threaten the user’s privacy and personal safety.
In order to better protect users’ data security, researchers have turned their attention to biometric authentication methods. Among them, identity authentication based on photoplethysmography (PPG) signals is one. The PPG signal is obtained by non-invasive physiological detection using photoplethysmography (PhotoPlethysmoGraphy). When the heart contracts and relaxes periodically, blood flow in the body also increases and decreases. When the light emitted by the light source penetrates the skin tissue, the hemoglobin in the blood absorbs it, and the intensity of the light reflected or projected back is also weakened and enhanced. At this time, the light receiver converts this change in light intensity into an electrical signal, the PPG signal [
1]. Existing research shows that by analyzing the original PPG signal and the key signs of derivatives such as systolic peak and diastolic blood pressure peak extracted from it, a series of benchmark features can be obtained, which can be used as unique biometric information of the user [
2,
3,
4]. Moreover, the PPG signal is collected through the sensor in the wearable device without additional operations, which makes continuous user authentication more natural. At the same time, the PPG signal can continuously monitor blood circulation, thereby achieving continuous user identity verification and improving the accuracy and reliability of identity authentication. PPG signals meet the essential characteristics of biometrics: universality, persistence, uniqueness, and ease of collection. Therefore, using PPG signals as biometrics to research identity authentication methods is feasible and can bring new possibilities to biometric identity authentication.
Consequently, this study proposes a deep learning-based identity authentication method for PPG signals. The primary contributions are as follows: firstly, the filtered one-dimensional temporal PPG data is partitioned into individual cycles and transformed into two-dimensional images using the Markov Transition Field (MTF) technique [
5]. Unlike most existing identity authentication methods that directly employ one-dimensional PPG signals as inputs, this paper initially converts the one-dimensional temporal data into two-dimensional image data before incorporating it into a convolutional neural network (CNN). Two-dimensional image data is richer than one-dimensional time series data because of its multi-dimensional spatial information. Especially in color images, the three independent channels of red, green, and blue each carry different image information. Therefore, in feature extraction by convolutional neural networks, two-dimensional image data shows incomparable advantages. Secondly, a lightweight CNN model is devised, integrating the concepts of Depthwise Separable Convolution [
6] and Residual Structures [
7]. This model maintains high recognition accuracy while reducing memory consumption and improving runtime efficiency. It is well-suited for resource-constrained devices such as wearable devices, which have small form factors, low power requirements, and limited computational resources. Identity authentication can be approached as a classification problem to gain deeper insights. Experimental results show that the lightweight CNN model proposed in this study achieves an accuracy rate of 98.62% on the training set and 96.17% on the testing set, both surpassing some traditional deep learning methods. Hence, this research demonstrates high feasibility.
The use of PPG signals for biometric identification has garnered significant attention and application in the current literature. Researchers have actively explored methods that use PPG signals to distinguish between individuals by extracting unique and stable features from these signals. Two primary categories of methods are available for use.
Regarding the first category of methods, analyzing waveform characteristics of signals and extracting features from both the time domain and frequency domain has been employed. Kavsaoğlu et al. [
8] successfully extracted 40-time domain features from the PPG signal and its first and second derivatives and innovatively proposed a feature sorting algorithm. They further verified the effectiveness of the algorithm through the k-NN classifier. Jaafar et al. [
9] utilized the second derivative of PPG signals, known as APG signals, obtained from ten individual users. The morphology of APG signals was analyzed to extract features, which were then classified using a Bayesian network. Meanwhile, Salanke et al. [
10] segmented PPG signals based on P-wave intervals and used principal component analysis to extract features successfully utilized for signal classification.
In the second category, deep neural network models are employed for automatic feature classification and learning from PPG signals. Deep learning methods can automatically learn useful features directly from raw data without requiring manual feature design and selection. Li et al. [
11] designed and trained a multi-scale feature fusion deep learning (MFFD) model. This model is mainly based on convolutional neural network architecture and is used to effectively extract the features of PPG signals and to learn how to accurately distinguish different individuals based on each person’s unique PPG pattern. Wei et al. [
12] first proposed PPG enhancement technology to generate multi-scale PPG signals and then proposed a deep end-to-end model to extract and classify multi-scale features. Seok et al. [
13] proposed a one-dimensional twin neural network biometric model based on PPG, which reduced noise and retained individual unique characteristics through the multi-period averaging method, achieving efficient and safe identification and authentication. Abbani et al. [
14] used a two-way long short-term memory deep learning algorithm in their research and successfully designed an identity authentication model based on PPG signals. Dwaipayan et al. [
15] designed a novel deep learning model, CorNET, which combines two convolutional neural network layers and two long short-term memory layers for identity authentication tasks. The network combines two convolutional neural layers and two long short-term memory layers. Jordi et al. [
16] proposed an end-to-end recognition architecture based on the original PPG signal, mainly built by a convolutional neural network. However, some methods convert PPG signals into two-dimensional image data and utilize deep learning methods for classification. Cherry et al. [
17] transformed one-dimensional PPG signals into two-dimensional spectrograms and employed convolutional neural networks for automatic feature extraction and classification tasks. Using the scalogram technique, Mostafa et al. [
18] converted one-dimensional PPG signals into two-dimensional images. They developed a CVT-ConvMixer classifier and attention mechanisms to achieve individual identity recognition.
The main problem of the first type of method in the above literature is that the features extracted by analyzing the waveform characteristics of the signal or the signal characteristics from the time domain and frequency domain are often not comprehensive enough, and manual processing of features is prone to errors. In the second type of method, the deep learning model used has a relatively complex network structure and many parameters, which results in high computational complexity and consumes many resources. The method in this article uses a neural network model to extract features and learn to classify PPG signals which are automatically converted into two-dimensional images. Compared with the literature mentioned in the first method above, the method used in this paper does not need manual processing of features but realizes end-to-end learning from original input to final output, which can significantly simplify the whole process of feature extraction and classification and improve the efficiency and practicality of the model. Deep learning methods can learn multi-level and multi-scale feature representations of data through multi-level neural network structures. This enables the model to perform effective feature extraction and recognition when facing new, unseen data and has strong generalization capabilities. Compared with the literature mentioned in the second method above, the deep learning model proposed in this article significantly reduces the model’s size by optimizing the network structure and reducing the number of parameters, making the model more efficient in storage and transmission. Efficient, especially for resource-constrained environments such as wearable devices, without sacrificing accuracy.
2. Methods
This section will comprehensively elucidate the identity authentication method utilizing PPG signals as a carrier. This method comprises three essential phases. The first step involves signal preprocessing to eliminate noise interference and enhance signal quality. Subsequently, the preprocessed one-dimensional signal is converted into a two-dimensional image signal to facilitate subsequent feature extraction and recognition. Lastly, the transformed two-dimensional signal is classified using a constructed lightweight convolutional neural network, thus achieving accurate identity authentication. The workflow is depicted in
Figure 1.
2.1. PPG Signal Preprocessing
Various types of noise often accompany PPG signals, acting as interference during the collection process [
19]. These noises mainly include baseline drift, which arises from respiratory fluctuations and the instability of amplification circuits; power line interference, originating from AC power sources; electromyographic noise, resulting from limb tremors and muscle contractions; and motion artifacts, caused by changes in the optical measurement due to bodily movements.
To enhance the quality of PPG signals, we used a 3rd-order bandpass Butterworth filter. Choosing a 3rd-order bandpass Butterworth filter can balance frequency selectivity and phase response during the filtering process, reducing distortion and signal delay. The design of this filter takes into account the characteristics of the PPG signal. It sets the high-pass cutoff frequency to 8 Hz, which can effectively filter out high-frequency interference caused by electromyographic noise and power frequency drift. The low-pass cutoff frequency is 0.5 Hz, which can effectively filter out low-frequency interference caused by baseline drift. The filtered PPG signals exhibited a significant quality improvement, forming the basis for subsequent identity authentication tasks. Furthermore, the filtered PPG signals underwent amplitude normalization to address amplitude variations in the signals. This step ensured a unified dynamic range of the signals at 1, further enhancing the accuracy and stability of the identity authentication process.
2.2. Two-Dimensional Signal Transformation Methods
Although various deep learning algorithms, such as 1D-CNN and LSTM, have been developed to handle one-dimensional time series data like PPG signals, two-dimensional images possess a wealth of spatial information and structural characteristics. Leveraging deep learning methods for feature extraction can effectively capture edge detection, corner identification, and other informative details, ultimately improving learning efficiency. Hence, the conversion of one-dimensional PPG signals into two-dimensional images is indispensable.
This paper adopts the Markov Transition Field (MTF) method to transform the one-dimensional PPG signals into two-dimensional images. MTF is an image encoding technique that uses the Markov transition matrix to encode time series data. It treats the temporal evolution of the time series as a Markov process, where the future state depends solely on the present state, independent of the past states. By constructing the Markov transition matrix to reflect this concept, we encode time series data as images by extending it into a Markov Transition Field. The main steps involved in this process are as follows:
Divide the time series data equally into Q different quantile bins and label these quantile bins from 1 to Q in sequence.
Replace each data point in the time series with its bin number.
Calculate the transfer frequency between each quantile bin along the time axis of the time series as a 1st-order Markov chain and construct a Q × Q transfer matrix W accordingly, as shown in Formula (1). In this matrix, the element ωij represents the transition frequency from quantile bin i to quantile bin j. The transition matrix provides quantitative information about the quantile bin transition patterns in the time series.
- 4.
In the Markov transfer field, x1 to xN are elements in the time series, qi and qj are quantile intervals of time steps i and j, respectively, and the transfer probability from qi to qj is Mij. Considering each probability of time position arrangement, the Markov transfer matrix W is extended to the Markov transfer field M, as shown in Formula (2).
This paper uses the peak detection method to split the long-segment PPG signal into independent single-cycle signals. Then, the single-cycle signals are used as input for the neural network model. During the segmentation process, all abnormal single-cycle signals are eliminated to ensure the accuracy and reliability of subsequent analysis. Subsequently, these single-period signals were successfully converted into two-dimensional images with a size of 28 × 28 using the Markov transfer field method, as shown in
Figure 2.
2.3. Lightweight Convolutional Neural Network, LW-CNN
Since this paper studies identity authentication based on wearable devices and considers the resource-limited characteristics of wearable devices, we propose a lightweight convolutional neural network LW-CNN with deep separable convolution ideas and residual connections. Compared with traditional convolutional neural networks (CNN), LW-CNN significantly reduces the complexity of the model by reducing the number of network layers, using smaller convolution kernels and a more straightforward network structure. Additionally, because of its simpler structure and fewer parameters, both forward and backward calculations require less computation, resulting in improved efficiency. Moreover, because the residual structure is introduced, the network structure is more stable, which improves the network’s performance and enhances its robustness.
Depthwise separable convolution can be regarded as a special convolution operation. In traditional convolution operations, each convolution kernel performs convolution operations simultaneously on all channels of the input feature map, thereby generating channels of the output feature map. If the input feature map has N channels, each convolution kernel requires N convolution sub-kernels for the input channels. However, depthwise separable convolution divides this process into two stages. The first part is the Depthwise Convolution layer, in which each input channel has its own 3 × 3 convolution kernel for independent convolution operations. This means there will be N convolution kernels if there are N input channels, and each kernel will only convolve one input channel. In this way, the deep convolutional layer can extract the features of each channel without increasing the number of parameters. Secondly, the pointwise convolution layer (Pointwise Convolution), after the depth convolution, uses a 1 × 1 convolution kernel to convolve the output of the depth convolution. This 1 × 1 convolution kernel is a cross-channel linear transformation. Its function is to fuse and combine the features of all channels to generate the final output feature map. In this way, depthwise separable convolution can significantly reduce the model’s parameters while maintaining the convolution operation’s spatial filtering capabilities. Specifically, for a traditional convolutional layer with N input channels and M output channels, the number of parameters is N × M × K × K (where K is the size of the convolution kernel). The number of parameters of the corresponding depth-separable convolution layer is reduced to N × K × K (depth convolution layer) + M × N (point-by-point convolution layer). Usually, M and N are both large, so this reduction is very significant.
Figure 3 is a schematic diagram of depthwise separable convolution.
The residual structure is proposed to solve the problem of gradient explosion or dispersion in neural network training. The core idea is to build a neural network by introducing residual blocks (Residual blocks). These residual blocks create a shortcut connection between the input and output, allowing information to pass more smoothly through the network. Specifically, each residual block contains multiple convolutional layers used to extract and transform features. At the same time, short-circuit connections allow the input signal to skip one or more layers and be added directly to the output of subsequent layers. This structural innovation ensures that gradients can flow back to previous layers more effectively during backpropagation, thereby avoiding the problem of vanishing gradients. Mathematically, this short-circuit connection can be expressed as y = F(x) + x, where y represents the output of the current layer, F(x) represents the feature map obtained by the current layer through operations such as convolution, and x is the output of the previous layer or layers, that is, the input of a short-circuit connection. This addition operation not only preserves the information of the original input but also allows the network to focus on learning the residual between the input and the output, that is, the difference between them. This residual learning method helps the network extract features more efficiently and makes the training of deep networks more stable. Since the gradient contains the derivative term concerning the input x, the gradient can be effectively propagated even in deep networks, alleviating the vanishing gradient problem. In addition, since the convolutional layers in the residual block are usually accompanied by batch normalization and ReLU activation functions, this further improves the stability and convergence speed of the network.
Figure 4 shows a schematic diagram of a three-layer residual structure.
The lightweight convolutional neural network model, LW-CNN, employed in this study primarily comprises an initial convolutional layer, three residual blocks, a global average pooling layer, a Dropout layer, and a fully connected layer. In the initial convolutional layer, a 3 × 3 kernel performs convolutional operations on the input image with a stride of 2. Padding is applied to the image edges to ensure consistency in spatial dimensions, transforming the original image with three channels into a feature map with 32 channels. The three residual blocks are responsible for expanding the number of channels in the feature map, achieving values of 64 and 128 and maintaining it at 128, respectively. Within each residual block, two convolutional layers are incorporated, wherein the first one employs a 3 × 3 kernel, and the second utilizes a 1 × 1 kernel. The stride for the 3 × 3 convolutional layers in the first and third residual blocks is set to 1, while it is 2 for the second residual block. The outputs are summed together by establishing shortcut connections between the two parts of each residual block, effectively addressing gradient vanishing problems and ensuring a smooth flow of information. A batch normalization layer and a ReLU activation function follow the initial and convolutional layers in the residual blocks. The former stabilizes the training process and expedites model convergence, while the latter enhances the model’s representative capacity and improves classification accuracy. Subsequently, a global average pooling layer is employed to reduce the spatial dimensions of the feature map to 1 × 1, yielding a feature vector. A Dropout layer is introduced to enhance the model’s generalization capability further. During training, the Dropout layer randomly discards specific neuron outputs, preventing the model from over-relying on specific neurons and thereby enhancing the model’s robustness. Finally, the feature vector is fed into a fully connected layer for classification to generate the final prediction. Experimental results show that setting the dropout rate to 0.5 achieves the best prediction performance. The schematic diagram of this lightweight convolutional neural network is depicted in
Figure 5.