3.2.1. Preprocessing
Preprocessing the photoplethysmography (PPG) signals is an essential step in analyzing and extracting heart biometric information.
Figure 2 visually illustrates the preprocessing of PPG signals. While the preprocessing steps are presented in Algorithm 1, PPG signals are crucial for capturing volumetric changes in blood vessels, offering insightful data about the cardiac cycle, heart rate, and various cardiovascular parameters. In preprocessing these signals for authentication systems, overlapping segments are typically employed to ensure no vital transitional information is missed between segments. A common practice is to use an overlap of 50%. For example, with 5 s segments, the subsequent segment would indeed start 2.5 s into the current one. This overlapping technique is essential for capturing more robust features in the PPG signal and is particularly beneficial for detecting short-lived but significant events or artifacts.
First step: if we are monitoring the PPG signal of a person to extract information about their heart rate and cardiovascular health, then we must perform preprocessing on the PPG signals. Initially, a raw PPG signal that contains noise from various sources, including ambient light fluctuations and electrical interference, is captured. To clean up the signal, we applied a bandpass filter. This filter removes high-frequency noise (like rapid fluctuations caused by ambient light) with a low-pass filter and eliminates baseline wander (slow drifts caused by motion or pressure changes) with a high-pass filter. The resulting signal is now smoother and focused on the pulsatile component, making it easier to analyze. Despite filtering, the PPG signal might still have some residual baseline drift due to small movements or changes in pressure. We have subtracted this baseline from the signal to isolate the variations caused by the pulsatile blood flow. Subtracting the baseline from a PPG signal isolates variations caused by pulsatile blood flow, enabling precise analysis of cardiovascular conditions and heart rate by minimizing interferences from non-pulsatile components. This step ensures that the data are centered around zero and helps in accurate heart rate extraction. During the recording, the person might move, leading to motion artifacts in the PPG signal. To address this, we have applied the motion artifact removal technique, principal component analysis (PCA). The idea is that the principal component(s) that capture the most variance in the signal may represent the true PPG signal, while the components capturing less variance might represent noise or artifacts. In practice, PCA is a linear method that is used for motion artifact removal in PPG signals due to its simplicity, computational efficiency, effectiveness in isolating vital signals, and capacity for reducing noise and dimensionality, making it broadly applicable.
To ensure consistency when comparing PPG signals across different individuals or sessions, we implemented a normalization process. This process involves scaling the signal’s amplitude, but rather than simply dividing the samples by the value of the highest amplitude peak, we chose a more tailored approach. We adjusted the signal so that its peak amplitude aligns with a specific physiological parameter, like blood volume change. This method of normalization, as opposed to restricting values to the [0, 1] interval, allows for a more physiologically relevant comparison of signals. It ensures that, despite varying signal amplitudes from different sources, the comparison focuses on meaningful physiological variations rather than just numerical values. As mentioned earlier in
Section 3.1, we have collected PPG signals from different sources, such as RW-PPG, BIDMC, and MIMIC. Therefore, it is very important to apply this normalization step to the collected datasets.
Given the variability in sampling rates of PPG signals, which can differ based on the recording equipment or conditions, we standardized these signals by resampling them at a uniform rate. For instance, since the original signals had sampling rates of 250 Hz and 500 Hz, we resampled all signals to a common rate of 250 Hz using a bicubic interpolation method. This resampling step is vital for ensuring accurate feature extraction and consistent analysis across all signals. After this preprocessing step, we proceed to assess the quality of the PPG signal, ensuring that the data used in our analysis maintains a high standard of consistency and reliability.
The PPG signal quality is evaluated by identifying segments where the signal quality might be compromised. We used instances where the signal-to-noise ratio (SNR) is low, indicating that the desired signal is being obscured or contaminated by noise. This suggests a process wherein the pre-processed PPG signals are scrutinized for segments where the reliability or clarity of the signal might be compromised due to external factors or inherent noise, ensuring that subsequent analyses or feature extraction are based on high-quality, reliable data. These assessments would likely involve analyzing the consistency, stability, and clarity of the signal waveform across the recorded data.
Algorithm 1: Preprocessing PPG Signals to Remove Noise and Artifacts
|
| |
3.2.2. Signal Transformation
Converting preprocessed PPG signals into 2-D images is a technique commonly used for further analysis. The overall steps are described in Algorithm 2. This algorithm is designed to transform multiple segments of preprocessed Photoplethysmography (PPG) signals into a 2D image known as a Scalogram. PPG signals are used to monitor various cardiovascular parameters, and Scalograms help visualize their frequency components over time.
The Continuous Wavelet Transform (CWT) is a technique in signal processing that differs from the Short-time Fourier Transform (STFT) by employing a dynamic window known as the main wavelet. This wavelet is both scaled and shifted during the transformation process, allowing for long time intervals at low frequencies and short time intervals at high frequencies. Unlike the STFT, where window sizes remain constant, the CWT can adapt by using windows of varying sizes, enabling it to effectively analyze both high- and low-frequency components in a time series [
34]. This adaptability makes CWT particularly useful for analyzing non-stationary signals like EEG. The method utilizes smaller scales for high frequencies and larger scales for low frequencies to achieve optimal resolution. In practice, the choice between CWT and STFT depends on the specific characteristics of the signal and the analysis goals. CWT is advantageous when high adaptability, precise frequency localization, and detailed time-frequency information are required. On the other hand, STFT is computationally more efficient and may be suitable for simpler cases where fine-grained time-frequency analysis is not critical.
Mathematically, the continuous-time representation of CWT is defined by Equation (1):
W(s, τ): are the wavelet coefficients. They represent the result of the CWT applied to a time signal x(t) at different scales (s) and positions (τ).
x(t): is the time-domain signal to be analyzed using CWT. It could be any non-stationary signal, like an EEG signal.
ψ(t): is the basic wavelet function and its complex conjugate. The choice of the wavelet function is crucial in CWT. We used the Morlet wavelet, which is known for its suitability in spectral analysis of non-stationary signals and is often preferred for its balance between time and frequency localization.
s: is the scale parameter. It controls the width of the wavelet function in the time-frequency domain. The s parameter in wavelet transforms indicates that a scale factor greater than 1 is typically used to capture low-frequency components of a signal, while a scale factor less than 1 is used for high-frequency components.
τ: is the position parameter, often referred to as time or shift. It determines where the wavelet function is centered in the time domain. Shifting the wavelet across the signal at different positions τ allows analyzing the signal’s time evolution at various points.
Continuous Wavelet Transform (CWT) is utilized to analyze PPG data and produce a visualization known as a scalogram. The PPG data, presumably a continuous-time signal, undergoes the CWT process, wherein it is analyzed through various frequency components by applying multiple expansions and time offsets of a wavelet, specifically the Morlet Continuous Wavelet in this context. The CWT allows the identification and visualization of local time-frequency energy density within the signal, depicted as a scalogram. Each segment of the PPG data are independently transformed through CWT, generating a scalogram image per segment. In this paper, we used a total of 500 images being generated, with 100 for each of the segments, indicating that the data are divided into distinct portions for analysis. These scalogram images effectively act as visual representations of the different frequency components present within each respective PPG signal segment, providing insight into the temporal and frequency characteristics of the blood volume changes during the cardiac cycles represented in each segment. Lastly, it is mentioned that sample scalogram images, likely exemplifying the resultant data post-CWT transformation, are displayed for three different individuals, potentially highlighting variations or unique features within their respective PPG signals.
We utilized the Morlet wavelet as a continuous wavelet transform (CWT) function as a fundamental step to develop a Scalogram-based image. Next, the algorithm expects a list of preprocessed PPG signal segments, which represent different time intervals of PPG data. Each segment is represented with values representing signal amplitudes at discrete time points. The algorithm proceeds by initializing an empty 2-D NumPy array called “image_matrix”. This array will store the Scalogram information for each PPG segment, with each row corresponding to a different segment and columns representing either time or frequency bins, depending on the Scalogram representation.
For each PPG segment in the input list, the algorithm performs the following steps: (1) Calculates the number of data points in the segment based on the specified as a segment_duration and sampling_rate. (2) Pad the PPG segment with zeros if its length is shorter than the desired length to ensure uniformity. (3) Utilizes the Continuous Wavelet Transform (CWT) to compute the Scalogram of the PPG segment. The chosen wavelet type, specified by the Morelet wavelet, affects the transformation. (4) Focuses on the magnitude of the frequency components by taking the absolute value of the Scalogram. (5) Resizes the Scalogram to match the desired image length specified by the desired_length parameter. The algorithm generates a 2-D image (a Scalogram) that visually represents the frequency content of multiple PPG signal segments using the Scalogram technique. This enables the analysis of time-varying frequency components in the PPG data. Users can modify parameters and replace, for example, PPG segments with their own data for specific applications.
Regarding the frequency scale of the scalograms ranging from 0 to 100 Hz, as shown in
Figure 3, it is important to clarify that while the standard heart rate range typically lies between 0.7 and 1.8 Hz, the wider frequency range in the scalograms is employed to capture not only the fundamental heart rate frequencies but also higher-frequency harmonics and transient events. These additional frequency components can provide valuable biometric information, such as subtle variations in heart rate patterns, which might be indicative of specific physiological or pathological conditions. The inclusion of a broader frequency range allows for a more comprehensive analysis of the PPG signal, which is crucial for the enhanced accuracy and reliability of our biometric authentication system.
Algorithm 2: Generating a 2-D Image (Scalogram) from Preprocessed PPG Signals
|
| |
3.2.3. Model Architecture–CVT-ConvMixer
This algorithm outlines the step-by-step process of building a hybrid model that combines the strengths of the Convolutional Vision Transformer and ConvMixer architectures. The resulting hybrid model is aimed at effectively classifying PPG signal feature vectors, ultimately contributing to biometric authentication efforts.
The purpose of this algorithm is to develop and train a hybrid model, combining the Convolutional Vision Transformer (CVT) and ConvMixer architectures, for classifying PPG signal feature vectors in the context of biometric authentication. The algorithm takes PPG signal feature vectors for training and testing, along with corresponding labels. First, the algorithm applies a preprocessing function to each PPG signal feature vector in the training dataset. This step may involve removing artifacts, normalizing data, and preparing it for further processing. The preprocessed signals are segmented into fixed-length segments. This segmentation process ensures that the signals are divided into smaller, manageable portions that can be fed into the model. To feed the segmented data into the model, the algorithm creates 2-D image maps from the segments. This involves reshaping the segments into a suitable image-like format. Zero-padding may also be applied to ensure consistent dimensions.
Convolutional Vision Transformer (CVT): The algorithm proceeds with the CVT component. It applies a Conv2D layer to generate convolutional embeddings from the image maps. The embeddings are then reshaped, and positional information is added to capture spatial relationships. Multi-head self-attention is applied using the MultiHeadAttention layer with parameters such as the number of heads and key dimensions. The attention outputs are aggregated to obtain CVT features [
35], capturing relevant spatial information.
In our research, we introduce a novel concept called the Convolutional Transformer Block, which incorporates convolutional layers into the Transformer structure. We believe that strategically adding convolutions to the ViT (Vision Transformer) can enhance its performance and robustness while still maintaining computational and memory efficiency. To validate our idea, we propose the Convolutional Vision Transformer (CVT), which seamlessly integrates convolutions into the Transformer framework, ensuring efficiency in terms of parameters and floating-point operations (FLOPs). The CVT model comprises two key components: the Vision Transformer (VT) and Feature Learning (FLs). The FL component is responsible for extracting useful features from the Continuous Wavelet Transform-based Spectrogram (CWTS) images. These learned features are then passed to the VT component, which further transforms them into a series of image pixels for the final detection phase. Unlike the Inception v3 model, the FL component does not include a fully connected layer and is focused on extracting CWTS features rather than classification, effectively creating a CNN (Convolutional Neural Network) without the fully connected layer.
Within the FLs component, there are 17 convolutional layers, each using a 3 × 3 kernel size, employing ReLU activation for non-linearity, and incorporating batch normalization (BN) for normalizing output features. Max-pooling is applied five times with a 2 × 2-pixel window and a stride of 2, reducing the image dimensions after each pooling operation.
The VT component of the CVT model takes the feature map of the CWTS spectrogram as input, creating seven patches from the feature maps. These patches are then linearly embedded into a sequence of length 1 × 1024, and their positional information is retained by adding position embeddings (2 × 1024 dimensions). These embeddings are then sent to the Transformer.
The VT encoder in the Vision Transformer employs Multi-Head Self-Attention (MSA) and Multi-Layer Perceptron (MLP) blocks. The MLP block serves as a Feedforward Network (FFN), and the Transformer’s internal layers are normalized. The Transformer uses eight attention heads, and the MLP head consists of ReLU nonlinearity followed by two linear layers, akin to a fully connected layer in a typical CNN architecture. The first layer has 2048 channels, while the last layer has two channels. The CVT model consists of 38.6 million trainable parameters and a total of 20 weighted layers. For the final detection task, Softmax is applied to the MLP head’s output to obtain class probabilities.
ConvMixer Model: Next, the ConvMixer component is applied. Similar to CVT, the algorithm utilizes a Conv2D layer to create ConvMixer embeddings. Multi-head self-attention is then applied, followed by the aggregation of attention outputs to derive ConvMixer features. This captures relevant spatial information similar to CVT but with a different approach. This patch-based processing is a fundamental aspect of ConvMixer’s design [
36], enabling the model to capture fine-grained features and achieve impressive performance on various tasks without the need for complex recurrent or attention mechanisms.
The ConvMixer component is a novel architectural approach to deep learning, particularly suited for tasks requiring spatial understanding and feature extraction from images. It introduces a unique combination of Conv2D layers and multi-head self-attention mechanisms. Similar to traditional Convolutional Neural Networks (CNNs), ConvMixer uses Conv2D layers as its foundational building blocks. These layers perform convolution operations on the input data, applying learnable filters to extract local features. Conv2D layers have been widely successful in image processing tasks, allowing models to capture hierarchical features from low-level edges to high-level object representations.
To develop the CVT-ConvMixer model, we address the challenge of capturing long-term dependencies in low-frequency heart rate data by transforming this data into a 2D spatial representation, such as scalograms. This transformation enables the model to interpret heart rate signals not just over time but also across various frequency components, facilitating the detection of extended dependencies. Additionally, the integration of self-attention mechanisms within the model significantly enhances its capability to identify and prioritize the most informative segments of the data. These self-attention layers enable the model to focus on crucial parts of the signal, regardless of their position in the sequence. Consequently, this innovative combination of 2D signal transformation and self-attention allows our model to effectively recognize and utilize long-term patterns and dependencies, crucial for robust biometric authentication based on heart rate data.
After the initial Conv2D layers, ConvMixer incorporates multi-head self-attention mechanisms. Self-attention enables the model to weigh the importance of different spatial positions when processing information. We have selected the task-specific criteria for the ConvMixer model, which might learn to prioritize or weigh attention towards elements that are crucial for predictive accuracy based on the learned correlations from the training data. Multi-head attention refers to running multiple parallel attention mechanisms, each focusing on different parts of the input. This allows the model to capture long-range dependencies and relationships between various spatial positions in the image. The outputs of the multiple attention heads are then aggregated to form ConvMixer features. This aggregation step combines the information learned by different heads, effectively capturing a holistic understanding of the spatial relationships within the input data. This process is crucial for enabling ConvMixer to handle complex visual patterns and relationships effectively.
One of the distinctive features of ConvMixer is its reliance on patch-based processing. Instead of using complex recurrent or attention mechanisms to model spatial relationships, ConvMixer divides the input image into smaller patches and processes each patch independently. This patch-based approach allows ConvMixer to capture fine-grained details and spatial relationships without the computational complexity associated with sequential or attention-based processing. It also enables ConvMixer to achieve remarkable performance on a wide range of computer vision tasks.
ConvMixer is a powerful architectural innovation that combines Conv2D layers and multi-head self-attention to capture spatial information effectively. Its patch-based processing approach simplifies the modeling of spatial relationships, making it efficient and suitable for various computer vision tasks. The initial layer of ConvMixer applies these principles to start the feature extraction process. Therefore, the first layer of ConvMixer is calculated by Equation (2) as:
The model’s second component is the primary ConvMixer layer, which is replicated for a certain depth. Within this layer, there is a residual block incorporating a depthwise convolution. In essence, a residual block combines the output of a prior layer with the output of a subsequent layer. In this specific scenario, the inputs are fused with the outcome of the depthwise convolution layer. Next to this fusion, there is an activation block, followed by a pointwise convolution, and another subsequent activation block, which is calculated by Equations (3) and (4) as:
The third component of the ConvMixer model introduces a crucial step: the inclusion of a global pooling layer. This step aims to derive a feature vector of size ‘h’ from the processed patches. Global pooling serves to compress the spatial dimensions of each patch into a fixed size, a pivotal transformation for subsequent tasks such as classification using a SoftMax classifier. In terms of activation functions, ConvMixer adopts the Gaussian error linear unit (GELU). GELU is a sophisticated and differentiable activation function renowned for its strong performance in deep neural networks. Unlike the Rectified Linear Unit (ReLU), which simply sets all negative values to zero, GELU applies a more intricate approach. It assigns weights to inputs based on their magnitude rather than solely relying on their sign, thus offering a more nuanced gating mechanism. This distinctive characteristic of GELU ensures the preservation of both positive and negative information within the activation, rendering it particularly well-suited for models like ConvMixer, where a nuanced understanding of features is critical. Therefore, the GELU function is calculated by Equation (5) as:
Features Fusion: Both CVT and ConvMixer features are fused using a concatenation or combination process. The algorithm utilizes a Flattening layer to combine these features, resulting in a comprehensive representation that benefits from both architectures. As shown in
Figure 4 and describe in Algorithm 3, our attentional selective fusion (ATTSF) consists of global attention and local attention, which can add more flexibility when fusing various forms of information and is calculated by Equation (6) as:
Final Classification Layer: A Dense layer is applied to the combined features to produce the final classification output. This layer employs a softmax activation function, enabling the model to output class probabilities for the input signals. Creating the Hybrid Model: The hybrid model is assembled by defining the input layer (representing the PPG signal feature vectors) and the final classification output. The entire architecture, comprising the CVT, ConvMixer, and fusion components, is encapsulated in the hybrid_model.
Algorithm 3: A Hybrid CVT-ConvMixer Model with Self-Attention Mechanisms
|
| |