1. Introduction
In modern society, the identification procedure in security systems is of critical importance and gaining increasing attention. In the past, the most widely used security tools were based on the use of passwords (something uniquely known only to the user) or tokens (proving that the user has something unique to identify him/her). These methods, however, proved to have a number of disadvantages: the above-mentioned security tools could be lost, stolen, discovered, or copied. Therefore, in order to prevent fraud and cybercrime from occurring, researchers around the world have devised innovative user recognition patterns based on more secure identification systems [
1].
Biometrics represents the perfect solution as it focuses on the intrinsic characteristics of the person, requiring his/her physical presence and minimizing the likelihood of success in terms of a possible intrusion. Biometrics can also be based on the psychophysical (fingerprint, iris, retina, heart sound, and ECG) [
2,
3,
4] or behavioral aspects of the person (voice, face, gait, etc.) [
5,
6]. Often, a biometric system aims to identify or authenticate a person based on the measurement of one or more biometric traits [
7,
8].
Many human traits have been proposed and studied for identity recognition purposes, such as fingerprints, face, iris, and voice.
Fingerprinting-based systems [
9] can be considered the predecessor of automatic human identification approaches. The Automated Fingerprint Identification System (AFIS) was first introduced in 1980 and enabled quick and efficient matching of fingerprints against the available data in large databases. The integration of fingerprint sensors into consumer electronics was introduced in the early 2000s, beginning with the introduction of fingerprint-based door locks and followed by integration into notebooks and smartphones. In the 21st century, fingerprints have become a key component of the biometric systems used in various industries, from banking to border control. Governments around the world have implemented biometric passports and national ID systems, utilizing fingerprints as a core element for secure and reliable identity verification. The fingerprinting approach has been successfully applied to other recognition systems, considering the meaning of “fingerprint” in a much broader sense than just “an impression left by the friction ridges of a human finger”. An audio fingerprint, for example, is a concise and distinct representation of an audio file or stream, enabling the identification of a song in any format without relying on embedded watermarks or metadata [
10,
11,
12,
13,
14]. However, fingerprinting-based identification systems have some drawbacks: the anti-cancer drug capecitabine can lead to the loss of fingerprints. Similarly, swelling of the fingers, such as that resulting from bee stings, may cause fingerprints to temporarily disappear. As skin elasticity decreases with age, many elderly individuals have fingerprints that are more difficult to capture. The ridges become thicker, and the difference in height between the top of the ridge and the bottom of the furrow diminishes, reducing their prominence. Fingerprints can be permanently removed, a tactic that criminals might use to lower their chances of being identified. This erasure can be accomplished by employing various methods, including burning the fingertips, applying acids, or employing advanced techniques like plastic surgery.
Thanks to technological advancements in the design of digital cameras and digital image processing algorithms, automatic face recognition methods have become quite common. However, the differences between individuals are often subtle, such as facial structures and the shapes of facial features tending to be quite similar. These characteristics alone are not sufficient for reliably distinguishing people based solely on their faces. Furthermore, the shape of the face is highly variable, influenced by factors such as expression, viewing angle, age, and lighting. In summary, while face recognition has high usability value, it suffers from low uniqueness quality and precision matching. Moreover, face images and videos are very easy to access. There is often no need to steal a user’s photo. Attackers can easily obtain the data they need from the internet, particularly through social networks. With those images and videos, it can be relatively simple to deceive an automatic face recognition system [
1,
15,
16].
Iris recognition can be performed by means of digital cameras too. It presents lower universality than a face recognition system because a certain number of users may have visual impairments. The strengths are uniqueness and permanence. Attackers can use high-resolution cameras to steal an iris image and attack an iris-based recognition system. Finally, contact lenses, whether colored or patterned, can bypass an iris recognition system in several ways: altering the iris details, creating a false pattern, creating reflections or distortions, or generating a partial occlusion of the iris. Overall, the use of lenses can compromise the accuracy of iris recognition systems, making them vulnerable to bypassing or false acceptance [
1,
17,
18].
The human voice possesses sufficient variability between individuals and demonstrates a high level of consistency within each person. This means that the distinct characteristics, such as pitch, tone, accent, and speech patterns, vary enough across different users to enable reliable identification. At the same time, these vocal traits remain stable for a given individual over time, enabling voice recognition systems to accurately authenticate users based on their unique vocal signatures. A simple microphone is required for voice data collection. However, since sound travels in all directions in an open environment, an attacker can record a user’s voice and replay it during the user authentication step. Accordingly, a voice-based authentication system can be deceived very simply [
1,
19,
20,
21].
More recently, a new set of biometric traits, called medical biometrics, have gained momentum. These types of systems are based on the individual unique features of the electrocardiogram (ECG) [
22,
23,
24,
25,
26,
27,
28] and phonocardiogram (PCG) [
29,
30,
31].
The above-mentioned signals represent a measure of the mechanical (PCG) and electrical (ECG) activity of the heart. The physiological nature of these two signals makes the biometric systems relying on them less vulnerable and more robust. Indeed, heart-based signals are known to be rather distinguishable across subjects (unique), hard to hide (measurable), and difficult to counterfeit (secure). To counterfeit heart biometric signals, an individual would essentially need to undergo significant surgical procedures to alter the structure of his/her body or, even more invasive, his/her heart (transplant). Additionally, the acquisition of such signals by third parties in a covert manner is nearly impossible without the active cooperation of the individual.
Considering the use of the PCG reading as a biometric signal, the authors in [
32,
33] paved the way for future research introducing the use of the sound produced by the heartbeat as a biometric system through the extraction of S1–S2 tones and the employment of various feature extraction techniques. Consequently, numerous studies were conducted by fellow researchers proposing innovative methods and algorithms for automatic biometric identification using PCG signals through feature extraction in the frequency and time domains, as well as machine learning and deep learning techniques [
34,
35,
36,
37,
38,
39].
The aim of the present study is to extend the initial research [
32,
33] focusing on the verification task by employing machine learning via a Multilayer Perceptron (MLP) network and further processing the cardiac sound signal. In particular, the study has set the following objectives:
Define a new set of features to perform subject verification based on the Mel-Frequency Cepstral Coefficient (MFCC) differences obtained in intra-subject and inter-subject sets.
Train an MLP neural network using as the input the MFCC-based features to classify the authenticity of a declared subject.
Evaluate the performance of the proposed model to perform the verification task of a subject.
Implement a mechanism based on a “recurrence filter” to improve the performance of the system, analyzing the outputs of the model regarding a broader time window of the sound signal.
Evaluate the system performance in the presence of typical environmental noise (e.g., office and babble noise).
2. Related Work
In the field of biometrics based on heart sounds, some authors of this study have built upon the previous research, including the pioneering work conducted by Beritelli and Serrano [
32]. In [
32], the authors achieved a 5% rejection rate and a 2.2% false acceptance rate using the z-chirp transform (CZT) for feature extraction and Euclidean distance (ED) for classification. Subsequently, the methodology [
40] was improved by incorporating sub-band aggregation, achieving an equal error rate (EER) of 10% with 70 subjects. In [
33], they refined the system further by employing Mel-Frequency Cepstral Coefficients (MFCCs), reducing the EER to 9% with 50 participants. Further testing with 40 participants lowered the EER to 5%. In [
41], they continued to develop methods using GMM and spectral features, improving the performance with a 13.7% EER on a dataset of 165 people. Instead, in [
42], the authors compared statistical and non-statistical approaches, with GMM-based statistical methods yielding a 15.53% EER, while non-statistical methods had a higher 29.08% EER, tested on 147 subjects. In [
34], the authors applied an identification system using LFCC and FSR for feature extraction and GMM for classification, achieving 13.66% accuracy with a database of 206 individuals.
Abo El Zahad and colleagues [
43,
44,
45] made significant contributions to the field of biometric systems using various feature extraction techniques. Initially, they developed a human verification system utilizing Wavelet Packet Cepstral Coefficients (WPCCs) for feature extraction, alongside Linear Discriminant Analysis (LDA) and the Bayes rule for classification, achieving an identification accuracy of 91.05% and an equal error rate (EER) of 3.2%. Later, they introduced a more robust human identification system employing several feature extraction methods, including Mel-Frequency Cepstral Coefficients (MFCCs), Linear Frequency Cepstral Coefficients (LFCCs), WPCCs, and Non-Linear Frequency Cepstral Coefficients (NLFCCs), obtaining EERs of 2.88% and 2.13% regarding different subject groups. In 2016, they further refined their methodology for individual identification, again utilizing MFCC, LFCC, Modified MFCC (M-MFCC), and WPCC approaches, achieving an identification accuracy of 91.05% and EERs of 3.2% and 2.68% regarding different groups. In [
46], the authors developed a PCG recognition system using MFCCs and k-means for feature extraction, tested on sixteen subjects with 626 heart sounds for training and six subjects with 120 heart sounds for testing. DNN achieved the highest accuracy of 91.12%. In [
29], the authors presented a PCG biometric system using wavelet preprocessing and various matching methods (ED, GMM, FSR, and VQ). The work proposed in [
47] describes a PCG identification method using autoregressive modeling, wavelets for de-noising, Hilbert envelope for segmentation, and bagged decision trees for classification, achieving 86.7% accuracy with 50 subjects. The authors in [
47] worked on two datasets with 60 and 50 subjects using MRD-MRR for preprocessing, SEE and multi-scale wavelet transforms for feature extraction, and various classifiers (RF, SVM, ANN, and KNN). RF had the highest accuracy in time–frequency analysis, and SVM in time-scale analysis.
In [
39], the authors analyzed 80 heart sounds from 40 subjects, using IMFs for feature extraction, logistic regression and HSMM for feature extraction, and Fisher ratio for feature selection, achieving 96.08% accuracy. In [
48], Tagashira and Nakagawa proposed a biometric system that works with identical features used for the detection of abnormal heart sounds and calculated from temporal sound power. Specifically, the features were obtained by summing the power spectral components calculated from the time–frequency analysis of heart sounds. They used the Mahalanobis distance (MD) obtained by the Mahalanobis–Taguchi (MT) method to perform identification. The performance of the proposed system produced a 90–100% authentication rate for 10 research participants. In [
49], the authors designed a non-invasive, discreet, and accurate device named the Continuous Cardiac Biometric (CCB) Patch. The device is integrated with the microphone chip on a flexible printed circuit board (PCB). The battery is encapsulated with the entire board inside a bio-compatible silicone case. The captured heart sounds are transmitted to a mobile device connected via Bluetooth. Following a two-stage filtering process with bandpass and zero-phase filters, the signals are preprocessed and labeled to mark each S1 and S2 heart sound peak. These labeled signals are then input into a machine learning program using a Convolutional Neural Network to create a profile of each individual’s heart sound. The performance, measured as accuracy on a test with 10 participants, consists of 98.3%. A few years later [
50], they extended their work, obtaining an accuracy of 99.5% on a dataset consisting of 20 participants.
Recently, the authors of [
51] introduced HeartPrint, a passive authentication system that takes advantage of the unique binaural bone-conducted PCGs recorded using dual in-ear monitors (IEMs)—sensors commonly found in widely used earables like Active Noise Cancellation (ANC) earphones and hearing aids. In particular, this method exploits both the human heart position, slightly on the left side of the body, causing different propagation paths for PCG to travel towards both ears, and distinct body asymmetry patterns shown by different individuals. Overall, these distinct characteristics result in the generation of unique binaural bone-conducted PCGs for each individual. Three different features are extracted, related to heart motion (2 × 16 MFCCs), body conduction (2 × 16 LPCs), and body asymmetry (energy at the output of 2 × 16 “5 Hz bandwidth” filters between 0 and 160 Hz). The detection of the S1 and S2 positions is mandatory to extract these features. First of all, a signal window is identified 200 ms before and 500 ms after the detected S1; three sets are obtained after the segmentation of the signal window into 32 overlapped frames lasting 64 ms. An RGB image is obtained from each window by scaling each feature value set to integers between 0 and 255, where the three matrices correspond to red, green, and blue channels. These RGB images are then fed to a CNN-based user classifier, which consists of three 2D-convolutional layers, one flatten layer, one fully connected layer, and an output layer, using the Softmax function to distinguish between a legitimate user and a potential attacker.
3. Materials and Methods
3.1. Background on Phonocardiogram Biometrics
Biometrics can be based on physiological or behavioral parameters. Biometric systems that are based on behavioral attributes can be traced back but not limited to systems that use voice, walk, and/or signature as the biometric key. Conversely, biometric systems that are based on physiological attributes employ systems that use iris, face, and/or heart signals (ECG/PCG) as the biometric key.
The PCG signal is the result of the sounds produced by the heart, and phonocardiography is the process of recording the sound produced during the cardiac cycle. The PCG, therefore, falls under physiological attributes as they are unique and variable.
A digital stethoscope must be used to record the sound and vibrations produced by the heartbeat. As we all know, the heart sound is generated by the opening and closing of the heart valve and the turbulence of the blood, which contains various physiological information. All in all, the heart sound can be considered as a complex, non-stationary, quasi-periodic signal with two basic tones: the first tone, S1, and the second heart sound, S2 (i.e., systolic murmur and diastolic murmur), together forming the cardiac cycle.
Several studies [
34,
36,
51] state that PCG signals are unique to each person and fulfil certain biometric requirements:
Universal: each and every person has a heart that produces PCG signals while beating;
Measurable: it is possible to record a PCG sequence using an electronic stethoscope;
Uniqueness: heart sound is unique for each person;
Vulnerability: the PCG signal is really difficult to falsify;
Usability: thanks to technological advances, wearable, lightweight, and minimally invasive devices have been created that facilitate the recording of PCG signals.
A biometric authentication system consists of several blocks (
Figure 1). Specifically, a biometric authentication system based on PCG acquires the human print (heart sounds) by a digital stethoscope positioned over the chest in proximity to the heart. The raw acquired signal is usually preprocessed to perform such operations as amplitude normalization and framing to subdivide the sequences of audio samples in overlapped windows of specific size. Features (usually based on analysis in the frequency domain) are then extracted from each window. In the training phase, these features are collected and labeled according to the subject identity in a system database with the aim to build the references for successive individuals’ authentication phases. Specific classifiers based on artificial intelligence or statistical models can be properly trained using as input these features (they usually tune their parameters according to the input features and related labels of identity). Other kinds of classifiers can use the stored features only in the authentication phase to compute, for example, specific distance metrics. In any case, in the authentication phase, the classifier performs matching between the current features acquired by the subject that is using the system and the features stored in the system database (eventually by means of the corresponding trained models). A post-processing step can be further inserted in order to obtain the decision taking into account a greater quantity of signal than that used to obtain a single decision by the classifier. This kind of post-processing is usually implemented by means of filters acting on the output of the classifier (i.e., mean or median filters).
Biometric authentication can be performed in two modes: identification (1:N) [
39], where N is the number of subjects the system is able to identify, and verification (1:1) [
29] of a predefined number I of subjects.
Identification mode (
Figure 1a): the system acquires information on the unique features of a particular user; i.e., it acquires biometric information and searches the entire database for a match of the acquired information. After classification, the biometric system decides which user the input sample corresponds to.
Verification Mode (
Figure 1b): the classifier decides whether or not the acquired trait information on a particular individual belongs to the declared user. The system compares the acquired data with previously stored information on the same individual and authenticates the particular individual.
Regardless of the mode, the system had to acquire at least a trust imprint related to all the subjects to identify or to verify.
This study focuses on a PCG biometric authentication system implementing solely a verification mode.
3.2. PCG Dataset
As preliminary step, it is fundamental to select an appropriate dataset of PCG recordings with the aim to train and test the proposed biometric verification system. The selected database is the HSCT-11 [
34], which has been thoroughly described and used in the state of the art.
The database contains heart sounds acquired from 206 people (157 male and 49 female). To the best of our knowledge, HSCT-11 is the dataset containing PCG recordings from the higher number of subjects [
30]; accordingly, it best adapts to our verification of human identity goal. It contains two PCG recordings for each subject, usually collected on the same day. The average length of each recording is 45 s, the minimum being 20 s and the maximum being 70 s.
Figure 2 shows examples of PCG recordings in the dataset for two different subjects (one female and one male). Regarding the images in the first row,
Figure 2a,b are related to the 1st and 2nd recordings of the female subject, respectively. Regarding the images in the second row,
Figure 2c,d are related to the 1st and 2nd recordings of the male subject, respectively.
With the same structure,
Figure 3 shows details of the PCG recordings in
Figure 2 extracted at 20th second and lasting 2 s.
It is important to note a chunk lasting 2 s always contains a double repetition of S1 and S2 tone.
3.3. Audio Signal Preprocessing
The original PCG sequences have been segmented into non-overlapped sub-sequences, called “chunks”, of shorter duration with constant length.
Specifically, an input window of s is used, whereby all sequences are segmented into chunks of duration. This subdivision leads to multiple PCG sub-sequences for each subject, for a total of 3967 chunks two seconds in duration.
The obtained two-second chunks are normalized in power; the power of all PCG chunks is calculated and an average is extrapolated, i.e., −27 dBFS. All audio sequences are normalized to this average power. Thus, the author achieves that all sequences with power higher than the predetermined average are attenuated; conversely, all sequences with lower power are amplified. This is conducted so as to avoid variability in terms of audio signal power comparing different recordings. Furthermore, PCG chunks are sampled at a sampling rate of Hz, resulting in a digital signal containing samples per chunk lasting 2 s.
3.4. Feature Extraction
After preprocessing, an MFCC parameter extraction step is performed on each chunk. Samples in each chunk are further subdivided in overlapping windows. Let be the length in samples of the window and be the length in samples of the hop. Given the global number of samples in each chunk, we obtain windows, the last one padded with zeroes. The steps to compute MFCCs on each window can be summarized as follows:
Take the discrete Fourier transform of the samples in the window as follows:
where
represents the samples in the
window.
Evaluate the powers of the spectrum at each discrete frequency by squaring the modules of the complex numbers obtained above as follows:
Map the powers of the spectrum obtained above onto the Mel scale using a specific number of
triangular overlapping windows. Converting the highest frequency in the Mel scale,
the boundaries of the triangular filters equally spaced in the Mel scale were computed as follows:
and, accordingly, the boundaries of the triangular filters in the frequency scale were evaluated as follows:
and, finally, the shape of each triangular filter in the frequency domain was computed as follows:
Given the filter bank, the power at the output of each filter was computed as follows:
Take the logs of the powers at the output of each filter:
Take the discrete cosine transform of the Mel log powers (MFCCs):
Actually, the values of varying k and r can be used to build a Mel spectrogram of the audio in a chunk.
After completing this feature extraction process, we compute a vector,
, of
values by averaging the output of each filter for all the
R windows:
Accordingly,
corresponds to the features extracted from the samples
in a chunk. In this study,
is set equal to 2048,
is set equal to 512 (
), and
is set equal to 50.
Figure 4 shows both the Mel spectrograms (built using
) and an image representation of the vector
for the audio chunk in previous
Figure 3. It appears evident that the image representations of the proposed features are quite similar for chunks of recordings belonging to the same subject, and, on the contrary, they show marked differences comparing the chunks of recordings belonging to different subjects.
3.5. System Database
Let
I be the number of subjects in database and
—the number of chunks extracted from the
subject’s PCG recording. Accordingly, we can consider for each subject a set of MFFCs as
Proposed human verification system expects a trusted and verified PCG recording is acquired for each subject. The recording has been processed as in previous section with the aim to obtain a sequence of MFCCs (imprint) representing each subject. At run time, each subject provides to the system his/her identity to be verified, and a new PCG recording of the subject is acquired. From this last PCG recording, a sequence of MFCCs are extracted. These are compared with the imprint of the declared identity in order to perform the verification task. The output of this last verification step can be “1” (subject is truthful) or “0” (subject is an impostor). We defined “1” as “positive” result and “0” as “negative” result. Accordingly, we can have a “false positive” classification if we misclassify as impostor a truthful subject, and, on the contrary, we can have a “false negative” classification if we misclassify as truthful an impostor. To perform the comparison between acquired MFCCs and stored ones as imprint of the declared subject, we considered simply the vectors obtained as difference. These difference vectors should approach the vector for a truthful subject, and it should move away from the point for an impostor. We have entrusted an MLP (properly trained) with the task of carrying out the classification.
Considering the MFCCs obtained from all the recordings in the HSCT-11 as starting point, we built two different sets, named class “1” (equal subject)
and class “0” (different subjects)
, respectively. Features belonging to class “1” are obtained using Equation (
3), i.e., subtracting the vectors containing the MFCCs obtained from each chunk belonging to PCG recordings of the same subject.
Specifically, for each subject, we built a subset containing the union of the vectors computed as the differences of each pair of MFCCs with the index of the second term higher than the index of the first term (intra-subject MFCC differences). The overall set is built taking the union of the subsets built for each subject. Definitively, class
contains features of truthful subjects who declared their true identity. Features belonging to class “0” are obtained using Equation (
4), i.e., subtracting the vectors containing the MFCCs obtained from each chunk belonging to PCG recordings of different subjects.
Specifically, for each possible couple of subjects, we built a subset containing the union of the vectors computed as the differences of each pair of MFCCs (inter-subject MFCC differences). The overall set is built taking the union of the subsets built for each subject. Definitively, class contains features of impostors that declared a false identity.
The cardinality (i.e., the size of a set in terms of number of elements it contains) of () is intrinsically lower than the cardinality of (). Thus, for the HSCT-11 dataset, we obtain the following: and . In order to normalize the number of items in each class, the dataset was filtered to obtain the same number of elements in both classes. Thus, class “0” was randomly decimated to contain 46,008 elements.
In order to train and subsequently test the neural network, the dataset is divided randomly in two distinct parts used for learning and testing, corresponding to 70% and 30%, respectively.
Table 1 shows the number of vectors containing MFCCs for each class in the dataset.
3.6. Classifier
In the spectrum of AI tools, Multilayer Perceptrons have become fundamental for classification tasks. MLPs applied to classification tasks are highly versatile and effective, able to differentiate between multiple classes, making them excellent for a variety of applications, from digit recognition to intricate object classification. At its foundation, an MLP designed for classification tasks is composed of multiple layers of neurons, each layer being fully connected to the next. The standard architecture includes the following:
The layer that receives the input features, i.e., “input layer”.
A series of intermediate layers in which neurons apply learned weights and biases to the inputs, passing the results through an activation function such as ReLU (Rectified Linear Unit), i.e., “hidden layers”. The number of hidden layers and neurons per layer can vary based on the complexity of the task.
The final layer in the MLP contains one neuron for each class, i.e., “output layer”. In binary classification (which is our case), a single neuron with a logistic activation function is used, producing a probability between 0 and 1. For multiclass classification, each class is represented by a neuron, and the Softmax activation function is applied to ensure the output probabilities add up to one.
Several hyperparameters play a crucial role in the performance of classification MLPs:
Number of Layers and Neurons: Increasing the number of layers and neurons allows the model to capture more complex patterns, but it also demands more computational resources and increases the risk of overfitting.
Activation Functions: ReLU is frequently used for hidden layers due to its simplicity and efficiency, while Softmax is typically employed in the output layer for multiclass classification.
Learning Rate: This parameter controls the step size in weight updates. A learning rate that is too high may cause the model to converge prematurely to a suboptimal solution, whereas a rate that is too low can slow down the training process.
Batch Size: This refers to the number of training examples processed in each iteration of weight updates. Smaller batch sizes may result in noisier updates but can often lead to faster convergence.
Usually, hyperparameter setting is based on heuristics, and the structure of the MLP-based classifier is obtained by consecutive refinements. As a first step, we adopt the same structure of the MLP used in [
52], except for the size of the input and output layer. Accordingly, as depicted in
Figure 5, three hidden layers have been adopted with 150, 100, and 50 neurons, respectively.
We used ReLU activation function in the hidden layers, Adam solver, and a constant learning rate equal to 0.001. Batch size was fixed equal to 200, and we used “log-loss” as loss function. The network is first appropriately trained with the “learning set” and then tested using the “testing set”. Adoption of this architecture and hyperparameter settings permitted us to obtain very good results. Accordingly, in this phase of our research, we decided not to further investigate architecture optimization and fine-tuning of hyperparameters of the MLP-based binary classifier.
To perform training, the network receives input vectors and vectors and corresponding labels as desired output belonging to the Learning set. As both and are computed as differences of two MFCC features, the input vectors consist of 50 real numbers. The network was trained by setting to 100 the maximum number of epochs.
In the testing phase, the network receives as input vectors and vectors belonging to the “test set” of the system database. Accordingly, for each input vector, it provides an output class y in the set . If the output class is equal to the class from which the vector was taken, we have a correct classification; otherwise, we have a misclassification. The network performs the final class attribution by means of a threshold-based step acting on the value assumed by the single node in output layer . If the value , class “1” is attributed; otherwise, “0”. The value of can be appropriately tuned to obtain specific goals, like equal error rate (EER) between false positive and false negative, as we have considered in our approach.
3.7. Post-Processing
To enhance the method’s robustness and accuracy, even when environmental noise is captured by the electronic stethoscope, a filter known as “recurrence filter” [
53] is employed in the decision block. This filter improves performance by analyzing a series of
subsequent decisions made by the MLP,
, ultimately producing the class
as the final output at the
iteration. More specifically, the filter acts on a circular vector (i.e., a queue with FIFO policy), which always contains the latest
decisions. After a transition phase lasting
seconds, the filter responds every
seconds.
The method then provides the first decision on the verification after
L iterations, when the circular buffer fills up, i.e., after a “response time” lasting
seconds. We named “response time”
the delay introduced by the application of the “recurrence filter” as post-processing block. It is important to note this delay affects only the first response of the system; i.e., it is not cumulative, and subsequent responses can be provided every
seconds.
Figure 6 shows the architecture of the recurrence filter.
In order to analyze the performance of the proposed approach both in terms of accuracy and response time, we analyzed the results varying
L in the range [0–10]. Accordingly, the response time analyzed is always in the range [2–22] seconds, as reported in results shown in
Section 4.3.
3.8. Analysis of Robustness to Environmental Noise
In order to analyze the system robustness during the testing phase, ambient noise of the “office noise” [
54] and “babble noise” [
55] types was digitally added to the PCG sequences in the testing database according to the Equation (
6):
where
represents the samples of the digital PCG audio signal split into two-second chunks;
represents the samples of the ambient noise split into two-second chunks;
represents the samples of the digital PCG audio signal with added noise. In particular, three noise testing datasets were created:
The first testing dataset contains the addition of “office noise”. The noise is normalized to a power of −42 dBFS so that the signal-to-noise ratio (SNR) is 15 dB. The noise testing database is denominated “NT-DB-Off-SNR-15dB”.
The second dataset contains the addition of “babble noise”. The noise is normalized to a power of −47 dBFS so that the SNR is 20 dB. The obtained test dataset is denominated “NT-DB-Bab-SNR-20dB”.
The third dataset also contains the addition of “babble noise”, but, in this case, the noise is normalized to a power of −57 dBFS so as to have an SNR equal to 30 dB. The obtained testing database is denominated “NT-DB-Bab-SNR-30dB”.
Figure 7 shows an example of chunk affected by the three ambient noises both in terms of signal in the time domain and Mel spectrogram. Specifically we report a chunk and related Mel spectrogram obtained from the 2nd recording of a female subject
. The graphs in the first row (
Figure 7a–c) show in blue
and in red
, where the latter includes “office noise” with SNR = 15 dB, “babble noise” with SNR = 20 dB, and “babble noise” with SNR = 30 dB, respectively. The graphs in the second row (
Figure 7d–f) show the Mel spectrograms and the features computed for the chunks
in the corresponding column of the first row.
In order to make the system uniform in terms of audio signal strength, power normalization to −27 dBFS (initial average power of all PCG chunks) is reapplied to all testing datasets.
Once the datasets are created, they undergo the feature extraction process; consequently, the biometric algorithm for identity verification was applied as described in the
Section 3.3,
Section 3.4,
Section 3.5,
Section 3.6 and
Section 3.7. The MFCC vectors obtained for each class have been fed to the system as input to evaluate its robustness to environmental noise.
3.9. Performance Metrics
To evaluate the performance of the proposed architecture, we analyzed the confusion matrices considering positive samples obtained from the same subject (class ) and negative samples obtained from different subjects (class ). Consequently, we can identify, comparing the true and predicted classes, the true positive (TP), true negative (TN), false positive (FP), and false negative (FN) outcomes.
The binary classifier requires a threshold value setting in order to maximize its performance. Performance has to be evaluated taking into account the target of the classification task.
Specifically, in a binary classifier, two parameters and their complimentary features can be taken into account:
True positive rate (TPR), also referred to as recall or sensitivity, is the probability of a positive test result, conditioned on the test item truly being positive .
True negative rate (TNR), also referred to as specificity, is the probability of a negative test result, conditioned on the test item truly being negative .
The false negative rate (FNR) is the proportion of actual positive cases that result in negative test outcomes. It corresponds to the complementarity of TPR: .
The false positive rate (FPR) is the probability of obtaining a positive result when the true value is negative (namely, the probability a false alarm is raised). It corresponds to the complementarity of TNR: .
A receiver operating characteristic (ROC) curve is a graphical representation that shows how well a binary classifier model performs across different threshold values (and can also be applied to multiclass classification). The ROC curve plots TPR versus FPR for each threshold setting. The area under the ROC curve (AUC) indicates the probability that the model will correctly rank a randomly selected positive example higher than a randomly selected negative example. AUC is a valuable metric for comparing the performance of two different models, provided that the dataset is relatively balanced. AUC varies in the range from 0.5 (worst model) to 1 (perfect model). The model with the larger area under the curve is typically considered better in the comparison. If a specific FPR is required, one can set the threshold in order to maximize the TPR at the desired FPR; conversely, if a specific TPR is required, one can set the threshold in order to minimize the FPR at the desired TPR.
A detection error tradeoff (DET) graph is a visual representation of error rates in binary classification systems, displaying the FNR against the FPR. It can be used to set the threshold and, in particular, permits to simply point out the condition of equal error rate (EER).
In addition to the already defined performance indexes “recall” and “specificity”, we decided to add the subsequent ones to analyze the performance in noisy conditions:
Precision: the ability to properly identify positive samples .
Negative predictive value (NPV): the ability to properly identify negative samples .
Accuracy: the fraction of all predictions correctly classified .
F1-score: the harmonic mean between precision and recall .
5. Discussion
In this section, we discuss the results obtained and compare them with state-of-the-art PCG-based human identity verification systems. It must be noted that, to perform a fair comparison, it is mandatory to consider studies using the same dataset and biometric method as it would be unreliable to compare studies using different approaches.
Those performances evaluated on a dataset consisting of PCG recordings from a high number of subjects are statistically more significant. At this time, HSCT-11 contains PCG recordings from 206 subjects, and, in terms of the number of subjects, it overcomes all other similar datasets [
30]. To the best of our knowledge, those papers using HSCT-11 to build and evaluate a human identity verification system based on PCG recordings include [
34,
36]. The first observation in comparison with the latter studies is that our approach introduced a neural network model to perform human identity verification based on PCG biometrics. Furthermore, this study analyzed the performance of the verification approach in the presence of environmental noise (obtaining an accuracy ranging from 89.10% to 100% when varying the type of noise and observation window).
Comparing the values of the performance parameters, Ref. [
34] achieved an EER = 13.66% by means of a statistical approach based on GMM as the classification model and features an array consisting of 16LFCC+
LFCC+E+
E.
The authors of [
36] proposed a different set of features with the aim to increase the performance of the human identity verification system proposed in [
34]. They obtained an EER = 11.16% using modified Mel-Frequency Cepstral Coefficients (M-MFCCs) and an EER = 8.73% using Wavelet Packet Decomposition (WPD) denoted as Wavelet Packet Cepstral Coefficients (WPCCs).
By means of the use of 50 MFCCs and our approach based on array differences and MLP, we obtained an EER = 5%, which is significantly lower than the values obtained in previous works. This last result was obtained by analyzing a set of signals lasting only 2 s. We proved that the error rate can be further reduced, obtaining accuracy, precision, and recall values almost equal to 100% (i.e., EER approaching 0%), thus increasing the length of the “recurrent filter” applied at the output of the MLP (and, accordingly, analyzing longer sets of signals lasting from 2 to 22 s). Finally, we demonstrated the robustness of our system to additive environmental noise, which has not been considered in similar previous works.
Although it does not use the HSCT-11 dataset and the approach used to acquire the signal is significantly different from ours (it uses sounds captured at the ear level by in-ear microphones), we proposed a qualitative comparison versus [
51], a recent paper that addressed human identity verification by means of PCG. In that paper, the performance was evaluated by examining the false accepted rate (FAR) and the false rejected rate (FRR). A 5-fold cross-validation experiment (20% training and 80% testing) was conducted in a static scenario involving 45 participants. Overall, in clean conditions, the authors reported average FAR and FRR values of 1.6% and 1.8%, respectively. These results appear to be better than ours, but one should take into account that the duration of the recordings was set in the range from 30 s to 5 min, significantly higher than our original 2 s. Moreover, the authors claimed a noticeable decline in their system’s overall performance in the presence of ambient noise, while our system appears to be robust to additive environment noise.
In general, comparing our approach to the HeartPrint system [
51], several notable advantages can be observed:
The simplicity of the proposed system is evident regarding the use of a compact feature set (only MFCCs) and the absence of signal alignment requirements, thus facilitating the creation of more economical and energy-efficient devices.
The performance is comparable (and also better) by analyzing shorter audio signals, resulting in faster authentication and reduced battery drain for devices not connected to power sources.
The robustness to ambient noise at very low SNRs enables effective use in crowded environments such as stations and airports.