Robust Biometric Verification Using Phonocardiogram Fingerprinting and a Multilayer-Perceptron-Based Classifier

Avanzato, Roberta; Beritelli, Francesco; Serrano, Salvatore

doi:10.3390/electronics13224377

Open AccessArticle

Robust Biometric Verification Using Phonocardiogram Fingerprinting and a Multilayer-Perceptron-Based Classifier

by

Roberta Avanzato

^1,*

,

Francesco Beritelli

^1,*

and

Salvatore Serrano

²

¹

Department of Electrical, Electronic and Computer Engineering, University of Catania, Viale Andrea Doria, 95125 Catania, Italy

²

Laboratory of Digital Signal Processing, Department of Engineering, University of Messina, C.da di Dio, 1 (Vill. S. Agata), 98166 Messina, Italy

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(22), 4377; https://doi.org/10.3390/electronics13224377

Submission received: 16 September 2024 / Revised: 28 October 2024 / Accepted: 7 November 2024 / Published: 8 November 2024

(This article belongs to the Special Issue Machine Learning in Electronic and Biomedical Engineering, 3rd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Recently, a new set of biometric traits, called medical biometrics, have been explored for human identity verification. This study introduces a novel framework for recognizing human identity through heart sound signals, commonly referred to as phonocardiograms (PCGs). The framework is built on extracting and suitably processing Mel-Frequency Cepstral Coefficients (MFCCs) from PCGs and on a classifier based on a Multilayer Perceptron (MLP) network. A large dataset containing heart sounds acquired from 206 people has been used to perform the experiments. The classifier was tuned to obtain the same false positive and false negative misclassification rates (equal error rate: EER = FPR = FNR) on chunks of audio lasting 2 s. This target has been reached, splitting the dataset into 70% and 30% training and testing non-overlapped subsets, respectively. A recurrence filter has been applied to also improve the performance of the system in the presence of noisy recordings. After the application of the filter on chunks of audio signal lasting from 2 to 22 s, the performance of the system has been evaluated in terms of recall, specificity, precision, negative predictive value, accuracy, and F1-score. All the performance metrics are higher than 97.86% with the recurrence filter applied on a window lasting 22 s and in different noise conditions.

Keywords:

biometrics; heart print verification; MLP neural networks; Mel-Frequency Cepstral Coefficient (MFCC)

1. Introduction

In modern society, the identification procedure in security systems is of critical importance and gaining increasing attention. In the past, the most widely used security tools were based on the use of passwords (something uniquely known only to the user) or tokens (proving that the user has something unique to identify him/her). These methods, however, proved to have a number of disadvantages: the above-mentioned security tools could be lost, stolen, discovered, or copied. Therefore, in order to prevent fraud and cybercrime from occurring, researchers around the world have devised innovative user recognition patterns based on more secure identification systems [1].

Biometrics represents the perfect solution as it focuses on the intrinsic characteristics of the person, requiring his/her physical presence and minimizing the likelihood of success in terms of a possible intrusion. Biometrics can also be based on the psychophysical (fingerprint, iris, retina, heart sound, and ECG) [2,3,4] or behavioral aspects of the person (voice, face, gait, etc.) [5,6]. Often, a biometric system aims to identify or authenticate a person based on the measurement of one or more biometric traits [7,8].

Many human traits have been proposed and studied for identity recognition purposes, such as fingerprints, face, iris, and voice.

Fingerprinting-based systems [9] can be considered the predecessor of automatic human identification approaches. The Automated Fingerprint Identification System (AFIS) was first introduced in 1980 and enabled quick and efficient matching of fingerprints against the available data in large databases. The integration of fingerprint sensors into consumer electronics was introduced in the early 2000s, beginning with the introduction of fingerprint-based door locks and followed by integration into notebooks and smartphones. In the 21st century, fingerprints have become a key component of the biometric systems used in various industries, from banking to border control. Governments around the world have implemented biometric passports and national ID systems, utilizing fingerprints as a core element for secure and reliable identity verification. The fingerprinting approach has been successfully applied to other recognition systems, considering the meaning of “fingerprint” in a much broader sense than just “an impression left by the friction ridges of a human finger”. An audio fingerprint, for example, is a concise and distinct representation of an audio file or stream, enabling the identification of a song in any format without relying on embedded watermarks or metadata [10,11,12,13,14]. However, fingerprinting-based identification systems have some drawbacks: the anti-cancer drug capecitabine can lead to the loss of fingerprints. Similarly, swelling of the fingers, such as that resulting from bee stings, may cause fingerprints to temporarily disappear. As skin elasticity decreases with age, many elderly individuals have fingerprints that are more difficult to capture. The ridges become thicker, and the difference in height between the top of the ridge and the bottom of the furrow diminishes, reducing their prominence. Fingerprints can be permanently removed, a tactic that criminals might use to lower their chances of being identified. This erasure can be accomplished by employing various methods, including burning the fingertips, applying acids, or employing advanced techniques like plastic surgery.

Thanks to technological advancements in the design of digital cameras and digital image processing algorithms, automatic face recognition methods have become quite common. However, the differences between individuals are often subtle, such as facial structures and the shapes of facial features tending to be quite similar. These characteristics alone are not sufficient for reliably distinguishing people based solely on their faces. Furthermore, the shape of the face is highly variable, influenced by factors such as expression, viewing angle, age, and lighting. In summary, while face recognition has high usability value, it suffers from low uniqueness quality and precision matching. Moreover, face images and videos are very easy to access. There is often no need to steal a user’s photo. Attackers can easily obtain the data they need from the internet, particularly through social networks. With those images and videos, it can be relatively simple to deceive an automatic face recognition system [1,15,16].

Iris recognition can be performed by means of digital cameras too. It presents lower universality than a face recognition system because a certain number of users may have visual impairments. The strengths are uniqueness and permanence. Attackers can use high-resolution cameras to steal an iris image and attack an iris-based recognition system. Finally, contact lenses, whether colored or patterned, can bypass an iris recognition system in several ways: altering the iris details, creating a false pattern, creating reflections or distortions, or generating a partial occlusion of the iris. Overall, the use of lenses can compromise the accuracy of iris recognition systems, making them vulnerable to bypassing or false acceptance [1,17,18].

The human voice possesses sufficient variability between individuals and demonstrates a high level of consistency within each person. This means that the distinct characteristics, such as pitch, tone, accent, and speech patterns, vary enough across different users to enable reliable identification. At the same time, these vocal traits remain stable for a given individual over time, enabling voice recognition systems to accurately authenticate users based on their unique vocal signatures. A simple microphone is required for voice data collection. However, since sound travels in all directions in an open environment, an attacker can record a user’s voice and replay it during the user authentication step. Accordingly, a voice-based authentication system can be deceived very simply [1,19,20,21].

More recently, a new set of biometric traits, called medical biometrics, have gained momentum. These types of systems are based on the individual unique features of the electrocardiogram (ECG) [22,23,24,25,26,27,28] and phonocardiogram (PCG) [29,30,31].

The above-mentioned signals represent a measure of the mechanical (PCG) and electrical (ECG) activity of the heart. The physiological nature of these two signals makes the biometric systems relying on them less vulnerable and more robust. Indeed, heart-based signals are known to be rather distinguishable across subjects (unique), hard to hide (measurable), and difficult to counterfeit (secure). To counterfeit heart biometric signals, an individual would essentially need to undergo significant surgical procedures to alter the structure of his/her body or, even more invasive, his/her heart (transplant). Additionally, the acquisition of such signals by third parties in a covert manner is nearly impossible without the active cooperation of the individual.

Considering the use of the PCG reading as a biometric signal, the authors in [32,33] paved the way for future research introducing the use of the sound produced by the heartbeat as a biometric system through the extraction of S1–S2 tones and the employment of various feature extraction techniques. Consequently, numerous studies were conducted by fellow researchers proposing innovative methods and algorithms for automatic biometric identification using PCG signals through feature extraction in the frequency and time domains, as well as machine learning and deep learning techniques [34,35,36,37,38,39].

The aim of the present study is to extend the initial research [32,33] focusing on the verification task by employing machine learning via a Multilayer Perceptron (MLP) network and further processing the cardiac sound signal. In particular, the study has set the following objectives:

Define a new set of features to perform subject verification based on the Mel-Frequency Cepstral Coefficient (MFCC) differences obtained in intra-subject and inter-subject sets.
Train an MLP neural network using as the input the MFCC-based features to classify the authenticity of a declared subject.
Evaluate the performance of the proposed model to perform the verification task of a subject.
Implement a mechanism based on a “recurrence filter” to improve the performance of the system, analyzing the outputs of the model regarding a broader time window of the sound signal.
Evaluate the system performance in the presence of typical environmental noise (e.g., office and babble noise).

2. Related Work

In the field of biometrics based on heart sounds, some authors of this study have built upon the previous research, including the pioneering work conducted by Beritelli and Serrano [32]. In [32], the authors achieved a 5% rejection rate and a 2.2% false acceptance rate using the z-chirp transform (CZT) for feature extraction and Euclidean distance (ED) for classification. Subsequently, the methodology [40] was improved by incorporating sub-band aggregation, achieving an equal error rate (EER) of 10% with 70 subjects. In [33], they refined the system further by employing Mel-Frequency Cepstral Coefficients (MFCCs), reducing the EER to 9% with 50 participants. Further testing with 40 participants lowered the EER to 5%. In [41], they continued to develop methods using GMM and spectral features, improving the performance with a 13.7% EER on a dataset of 165 people. Instead, in [42], the authors compared statistical and non-statistical approaches, with GMM-based statistical methods yielding a 15.53% EER, while non-statistical methods had a higher 29.08% EER, tested on 147 subjects. In [34], the authors applied an identification system using LFCC and FSR for feature extraction and GMM for classification, achieving 13.66% accuracy with a database of 206 individuals.

Abo El Zahad and colleagues [43,44,45] made significant contributions to the field of biometric systems using various feature extraction techniques. Initially, they developed a human verification system utilizing Wavelet Packet Cepstral Coefficients (WPCCs) for feature extraction, alongside Linear Discriminant Analysis (LDA) and the Bayes rule for classification, achieving an identification accuracy of 91.05% and an equal error rate (EER) of 3.2%. Later, they introduced a more robust human identification system employing several feature extraction methods, including Mel-Frequency Cepstral Coefficients (MFCCs), Linear Frequency Cepstral Coefficients (LFCCs), WPCCs, and Non-Linear Frequency Cepstral Coefficients (NLFCCs), obtaining EERs of 2.88% and 2.13% regarding different subject groups. In 2016, they further refined their methodology for individual identification, again utilizing MFCC, LFCC, Modified MFCC (M-MFCC), and WPCC approaches, achieving an identification accuracy of 91.05% and EERs of 3.2% and 2.68% regarding different groups. In [46], the authors developed a PCG recognition system using MFCCs and k-means for feature extraction, tested on sixteen subjects with 626 heart sounds for training and six subjects with 120 heart sounds for testing. DNN achieved the highest accuracy of 91.12%. In [29], the authors presented a PCG biometric system using wavelet preprocessing and various matching methods (ED, GMM, FSR, and VQ). The work proposed in [47] describes a PCG identification method using autoregressive modeling, wavelets for de-noising, Hilbert envelope for segmentation, and bagged decision trees for classification, achieving 86.7% accuracy with 50 subjects. The authors in [47] worked on two datasets with 60 and 50 subjects using MRD-MRR for preprocessing, SEE and multi-scale wavelet transforms for feature extraction, and various classifiers (RF, SVM, ANN, and KNN). RF had the highest accuracy in time–frequency analysis, and SVM in time-scale analysis.

In [39], the authors analyzed 80 heart sounds from 40 subjects, using IMFs for feature extraction, logistic regression and HSMM for feature extraction, and Fisher ratio for feature selection, achieving 96.08% accuracy. In [48], Tagashira and Nakagawa proposed a biometric system that works with identical features used for the detection of abnormal heart sounds and calculated from temporal sound power. Specifically, the features were obtained by summing the power spectral components calculated from the time–frequency analysis of heart sounds. They used the Mahalanobis distance (MD) obtained by the Mahalanobis–Taguchi (MT) method to perform identification. The performance of the proposed system produced a 90–100% authentication rate for 10 research participants. In [49], the authors designed a non-invasive, discreet, and accurate device named the Continuous Cardiac Biometric (CCB) Patch. The device is integrated with the microphone chip on a flexible printed circuit board (PCB). The battery is encapsulated with the entire board inside a bio-compatible silicone case. The captured heart sounds are transmitted to a mobile device connected via Bluetooth. Following a two-stage filtering process with bandpass and zero-phase filters, the signals are preprocessed and labeled to mark each S1 and S2 heart sound peak. These labeled signals are then input into a machine learning program using a Convolutional Neural Network to create a profile of each individual’s heart sound. The performance, measured as accuracy on a test with 10 participants, consists of 98.3%. A few years later [50], they extended their work, obtaining an accuracy of 99.5% on a dataset consisting of 20 participants.

Recently, the authors of [51] introduced HeartPrint, a passive authentication system that takes advantage of the unique binaural bone-conducted PCGs recorded using dual in-ear monitors (IEMs)—sensors commonly found in widely used earables like Active Noise Cancellation (ANC) earphones and hearing aids. In particular, this method exploits both the human heart position, slightly on the left side of the body, causing different propagation paths for PCG to travel towards both ears, and distinct body asymmetry patterns shown by different individuals. Overall, these distinct characteristics result in the generation of unique binaural bone-conducted PCGs for each individual. Three different features are extracted, related to heart motion (2 × 16 MFCCs), body conduction (2 × 16 LPCs), and body asymmetry (energy at the output of 2 × 16 “5 Hz bandwidth” filters between 0 and 160 Hz). The detection of the S1 and S2 positions is mandatory to extract these features. First of all, a signal window is identified 200 ms before and 500 ms after the detected S1; three sets are obtained after the segmentation of the signal window into 32 overlapped frames lasting 64 ms. An RGB image is obtained from each window by scaling each feature value set to integers between 0 and 255, where the three matrices correspond to red, green, and blue channels. These RGB images are then fed to a CNN-based user classifier, which consists of three 2D-convolutional layers, one flatten layer, one fully connected layer, and an output layer, using the Softmax function to distinguish between a legitimate user and a potential attacker.

3. Materials and Methods

3.1. Background on Phonocardiogram Biometrics

Biometrics can be based on physiological or behavioral parameters. Biometric systems that are based on behavioral attributes can be traced back but not limited to systems that use voice, walk, and/or signature as the biometric key. Conversely, biometric systems that are based on physiological attributes employ systems that use iris, face, and/or heart signals (ECG/PCG) as the biometric key.

The PCG signal is the result of the sounds produced by the heart, and phonocardiography is the process of recording the sound produced during the cardiac cycle. The PCG, therefore, falls under physiological attributes as they are unique and variable.

A digital stethoscope must be used to record the sound and vibrations produced by the heartbeat. As we all know, the heart sound is generated by the opening and closing of the heart valve and the turbulence of the blood, which contains various physiological information. All in all, the heart sound can be considered as a complex, non-stationary, quasi-periodic signal with two basic tones: the first tone, S1, and the second heart sound, S2 (i.e., systolic murmur and diastolic murmur), together forming the cardiac cycle.

Several studies [34,36,51] state that PCG signals are unique to each person and fulfil certain biometric requirements:

Universal: each and every person has a heart that produces PCG signals while beating;
Measurable: it is possible to record a PCG sequence using an electronic stethoscope;
Uniqueness: heart sound is unique for each person;
Vulnerability: the PCG signal is really difficult to falsify;
Usability: thanks to technological advances, wearable, lightweight, and minimally invasive devices have been created that facilitate the recording of PCG signals.

A biometric authentication system consists of several blocks (Figure 1). Specifically, a biometric authentication system based on PCG acquires the human print (heart sounds) by a digital stethoscope positioned over the chest in proximity to the heart. The raw acquired signal is usually preprocessed to perform such operations as amplitude normalization and framing to subdivide the sequences of audio samples in overlapped windows of specific size. Features (usually based on analysis in the frequency domain) are then extracted from each window. In the training phase, these features are collected and labeled according to the subject identity in a system database with the aim to build the references for successive individuals’ authentication phases. Specific classifiers based on artificial intelligence or statistical models can be properly trained using as input these features (they usually tune their parameters according to the input features and related labels of identity). Other kinds of classifiers can use the stored features only in the authentication phase to compute, for example, specific distance metrics. In any case, in the authentication phase, the classifier performs matching between the current features acquired by the subject that is using the system and the features stored in the system database (eventually by means of the corresponding trained models). A post-processing step can be further inserted in order to obtain the decision taking into account a greater quantity of signal than that used to obtain a single decision by the classifier. This kind of post-processing is usually implemented by means of filters acting on the output of the classifier (i.e., mean or median filters).

Biometric authentication can be performed in two modes: identification (1:N) [39], where N is the number of subjects the system is able to identify, and verification (1:1) [29] of a predefined number I of subjects.

Identification mode (Figure 1a): the system acquires information on the unique features of a particular user; i.e., it acquires biometric information and searches the entire database for a match of the acquired information. After classification, the biometric system decides which user the input sample corresponds to.
Verification Mode (Figure 1b): the classifier decides whether or not the acquired trait information on a particular individual belongs to the declared user. The system compares the acquired data with previously stored information on the same individual and authenticates the particular individual.

Regardless of the mode, the system had to acquire at least a trust imprint related to all the subjects to identify or to verify. This study focuses on a PCG biometric authentication system implementing solely a verification mode.

3.2. PCG Dataset

As preliminary step, it is fundamental to select an appropriate dataset of PCG recordings with the aim to train and test the proposed biometric verification system. The selected database is the HSCT-11 [34], which has been thoroughly described and used in the state of the art.

The database contains heart sounds acquired from 206 people (157 male and 49 female). To the best of our knowledge, HSCT-11 is the dataset containing PCG recordings from the higher number of subjects [30]; accordingly, it best adapts to our verification of human identity goal. It contains two PCG recordings for each subject, usually collected on the same day. The average length of each recording is 45 s, the minimum being 20 s and the maximum being 70 s.

Figure 2 shows examples of PCG recordings in the dataset for two different subjects (one female and one male). Regarding the images in the first row, Figure 2a,b are related to the 1st and 2nd recordings of the female subject, respectively. Regarding the images in the second row, Figure 2c,d are related to the 1st and 2nd recordings of the male subject, respectively.

With the same structure, Figure 3 shows details of the PCG recordings in Figure 2 extracted at 20th second and lasting 2 s.

It is important to note a chunk lasting 2 s always contains a double repetition of S1 and S2 tone.

3.3. Audio Signal Preprocessing

The original PCG sequences have been segmented into non-overlapped sub-sequences, called “chunks”, of shorter duration with constant length.

Specifically, an input window of

W_{a} = 2

s is used, whereby all sequences are segmented into chunks of

W_{a}

duration. This subdivision leads to multiple PCG sub-sequences for each subject, for a total of 3967 chunks two seconds in duration.

The obtained two-second chunks are normalized in power; the power of all PCG chunks is calculated and an average is extrapolated, i.e., −27 dBFS. All audio sequences are normalized to this average power. Thus, the author achieves that all sequences with power higher than the predetermined average are attenuated; conversely, all sequences with lower power are amplified. This is conducted so as to avoid variability in terms of audio signal power comparing different recordings. Furthermore, PCG chunks are sampled at a sampling rate of

S_{r} = 11,025

Hz, resulting in a digital signal

x_{n}

containing

W_{a}^{'} = 22,050

samples per chunk lasting 2 s.

3.4. Feature Extraction

After preprocessing, an MFCC parameter extraction step is performed on each chunk. Samples in each chunk are further subdivided in overlapping windows. Let

W_{l}

be the length in samples of the window and

W_{h}

be the length in samples of the hop. Given the global number of samples in each chunk, we obtain

R = ⌈\frac{W_{a}^{'}}{W_{h}}⌉

windows, the last one padded with zeroes. The steps to compute MFCCs on each window can be summarized as follows:

Take the discrete Fourier transform of the samples in the window as follows:

$X_{k}^{(r)} = \sum_{i = 0}^{W_{l} - 1} x_{i}^{(r)} e^{- 𝚥 \frac{2 π i k}{W_{l}}} k = 0, \dots, W_{l} - 1$

where $x_{i}^{(r)}$ represents the samples in the $r th$ window.
Evaluate the powers of the spectrum at each discrete frequency by squaring the modules of the complex numbers obtained above as follows:

$S_{k}^{(r)} = {| X_{k} (r) |}^{2} k = 0, \dots, W_{l} - 1$
Map the powers of the spectrum obtained above onto the Mel scale using a specific number of $N_{m e l}$ triangular overlapping windows. Converting the highest frequency in the Mel scale,

$M e l_{m a x} = 2595 {log}_{10} (1 + \frac{S_{r}}{2 \cdot 700}),$

the boundaries of the triangular filters equally spaced in the Mel scale were computed as follows:

$M_{n} = \frac{M e l_{m a x}}{N_{m e l} + 2} \cdot n n = 0, \dots, N_{m e l} + 2,$

and, accordingly, the boundaries of the triangular filters in the frequency scale were evaluated as follows:

$f_{n} = 700 (10^{\frac{M_{n}}{2595}} - 1),$

and, finally, the shape of each triangular filter in the frequency domain was computed as follows:

$H_{k}^{(n)} = \{\begin{matrix} 0 & k < f_{n - 1} \\ \frac{k - f_{n - 1}}{f_{n} - f_{n - 1}} & f_{n - 1} \leq k < f_{n} \\ 1 & k = f_{m} \\ \frac{f_{n + 1} - k}{f_{n + 1} - f_{n}} & f_{n} < k \leq f_{n + 1} \\ 0 & k > f_{n + 1} \end{matrix} n = 1, \dots, N_{m e l} .$

Given the filter bank, the power at the output of each filter was computed as follows:

$S_{n}^{(r)} = \sum_{k = 0}^{W_{l} - 1} S_{k}^{(r)} H_{k}^{(n)} n = 1, \dots, N_{m e l} .$
Take the logs of the powers at the output of each filter:

$Y_{n}^{(r)} = {log}_{10} (S_{n}^{(r)}) n = 1, \dots, N_{m e l}$
Take the discrete cosine transform of the Mel log powers (MFCCs):

$C_{k}^{(r)} = 2 \sum_{n = 0}^{N_{m e l} - 1} Y_{n + 1}^{(r)} cos (\frac{π k (2 n + 1)}{N_{m e l}}) k = 0, \dots, N_{m e l} - 1, r = 0, \dots, R - 1$

Actually, the values of

C_{k}^{(r)}

varying k and r can be used to build a Mel spectrogram of the audio in a chunk.

After completing this feature extraction process, we compute a vector,

\vec{c}

, of

N_{m e l}

values by averaging the output of each filter for all the R windows:

c_{k} = \frac{1}{R} \sum_{r = 0}^{R - 1} C_{k}^{(r)} k = 0, \dots, N_{m e l} - 1

(1)

Accordingly,

\vec{c}

corresponds to the features extracted from the samples

x_{n}

in a chunk. In this study,

W_{l}

is set equal to 2048,

W_{h}

is set equal to 512 (

R = 44

), and

N_{m e l}

is set equal to 50. Figure 4 shows both the Mel spectrograms (built using

C_{k}^{(r)}

) and an image representation of the vector

\vec{c}

for the audio chunk in previous Figure 3. It appears evident that the image representations of the proposed features are quite similar for chunks of recordings belonging to the same subject, and, on the contrary, they show marked differences comparing the chunks of recordings belonging to different subjects.

3.5. System Database

Let I be the number of subjects in database and

K_{i}

—the number of chunks extracted from the

i th

subject’s PCG recording. Accordingly, we can consider for each subject a set of MFFCs as

C^{(i)} = \{{\vec{c}}_{1}^{(i)}, {\vec{c}}_{2}^{(i)}, \dots, {\vec{c}}_{K_{i}}^{(i)}\} \forall i = 1, \dots, I

(2)

Proposed human verification system expects a trusted and verified PCG recording is acquired for each subject. The recording has been processed as in previous section with the aim to obtain a sequence of MFCCs (imprint) representing each subject. At run time, each subject provides to the system his/her identity to be verified, and a new PCG recording of the subject is acquired. From this last PCG recording, a sequence of MFCCs are extracted. These are compared with the imprint of the declared identity in order to perform the verification task. The output of this last verification step can be “1” (subject is truthful) or “0” (subject is an impostor). We defined “1” as “positive” result and “0” as “negative” result. Accordingly, we can have a “false positive” classification if we misclassify as impostor a truthful subject, and, on the contrary, we can have a “false negative” classification if we misclassify as truthful an impostor. To perform the comparison between acquired MFCCs and stored ones as imprint of the declared subject, we considered simply the vectors obtained as difference. These difference vectors should approach the vector

\vec{0}

for a truthful subject, and it should move away from the

\vec{0}

point for an impostor. We have entrusted an MLP (properly trained) with the task of carrying out the classification.

Considering the MFCCs obtained from all the recordings in the HSCT-11 as starting point, we built two different sets, named class “1” (equal subject)

C_{1}

and class “0” (different subjects)

C_{0}

, respectively. Features belonging to class “1” are obtained using Equation (3), i.e., subtracting the vectors containing the MFCCs obtained from each chunk belonging to PCG recordings of the same subject.

\begin{matrix} C_{1} = ⋃_{i = 1}^{I} \underset{intra-subject MFCCs differences}{\underset{︸}{\{⋃_{l = 1}^{K_{i} - 1} \{⋃_{m = l + 1}^{K_{i}} \{({\vec{c}}_{l}^{i} - {\vec{c}}_{m}^{i})\}\}\}}} \end{matrix}

(3)

Specifically, for each subject, we built a subset containing the union of the vectors computed as the differences of each pair of MFCCs with the index of the second term higher than the index of the first term (intra-subject MFCC differences). The overall set is built taking the union of the subsets built for each subject. Definitively, class

C_{1}

contains features of truthful subjects who declared their true identity. Features belonging to class “0” are obtained using Equation (4), i.e., subtracting the vectors containing the MFCCs obtained from each chunk belonging to PCG recordings of different subjects.

\begin{matrix} C_{0} = ⋃_{i = 1}^{I - 1} \{⋃_{j = i + 1}^{I} \underset{inter-subjects MFCCs differences}{\underset{︸}{\{⋃_{l = 1}^{K_{i}} \{⋃_{m = 1}^{K_{j}} \{({\vec{c}}_{l}^{i} - {\vec{c}}_{m}^{j})\}\}\}}}\} \end{matrix}

(4)

Specifically, for each possible couple of subjects, we built a subset containing the union of the vectors computed as the differences of each pair of MFCCs (inter-subject MFCC differences). The overall set is built taking the union of the subsets built for each subject. Definitively, class

C_{1}

contains features of impostors that declared a false identity.

The cardinality (i.e., the size of a set in terms of number of elements it contains) of

C_{1}

(

| C_{1} | = \sum_{i = 1}^{I} (\binom{K_{i}}{2})

) is intrinsically lower than the cardinality of

C_{0}

(

| C_{0} | = \sum_{i = 1}^{I - 1} \sum_{j = i + 1}^{I} K_{i} K_{j}

). Thus, for the HSCT-11 dataset, we obtain the following:

| C_{1} | = 46,008

and

| C_{0} | = 7,820,553

. In order to normalize the number of items in each class, the dataset was filtered to obtain the same number of elements in both classes. Thus, class “0” was randomly decimated to contain 46,008 elements.

In order to train and subsequently test the neural network, the dataset is divided randomly in two distinct parts used for learning and testing, corresponding to 70% and 30%, respectively. Table 1 shows the number of vectors containing MFCCs for each class in the dataset.

3.6. Classifier

In the spectrum of AI tools, Multilayer Perceptrons have become fundamental for classification tasks. MLPs applied to classification tasks are highly versatile and effective, able to differentiate between multiple classes, making them excellent for a variety of applications, from digit recognition to intricate object classification. At its foundation, an MLP designed for classification tasks is composed of multiple layers of neurons, each layer being fully connected to the next. The standard architecture includes the following:

The layer that receives the input features, i.e., “input layer”.
A series of intermediate layers in which neurons apply learned weights and biases to the inputs, passing the results through an activation function such as ReLU (Rectified Linear Unit), i.e., “hidden layers”. The number of hidden layers and neurons per layer can vary based on the complexity of the task.
The final layer in the MLP contains one neuron for each class, i.e., “output layer”. In binary classification (which is our case), a single neuron with a logistic activation function is used, producing a probability between 0 and 1. For multiclass classification, each class is represented by a neuron, and the Softmax activation function is applied to ensure the output probabilities add up to one.

Several hyperparameters play a crucial role in the performance of classification MLPs:

Number of Layers and Neurons: Increasing the number of layers and neurons allows the model to capture more complex patterns, but it also demands more computational resources and increases the risk of overfitting.
Activation Functions: ReLU is frequently used for hidden layers due to its simplicity and efficiency, while Softmax is typically employed in the output layer for multiclass classification.
Learning Rate: This parameter controls the step size in weight updates. A learning rate that is too high may cause the model to converge prematurely to a suboptimal solution, whereas a rate that is too low can slow down the training process.
Batch Size: This refers to the number of training examples processed in each iteration of weight updates. Smaller batch sizes may result in noisier updates but can often lead to faster convergence.

Usually, hyperparameter setting is based on heuristics, and the structure of the MLP-based classifier is obtained by consecutive refinements. As a first step, we adopt the same structure of the MLP used in [52], except for the size of the input and output layer. Accordingly, as depicted in Figure 5, three hidden layers have been adopted with 150, 100, and 50 neurons, respectively.

We used ReLU activation function in the hidden layers, Adam solver, and a constant learning rate equal to 0.001. Batch size was fixed equal to 200, and we used “log-loss” as loss function. The network is first appropriately trained with the “learning set” and then tested using the “testing set”. Adoption of this architecture and hyperparameter settings permitted us to obtain very good results. Accordingly, in this phase of our research, we decided not to further investigate architecture optimization and fine-tuning of hyperparameters of the MLP-based binary classifier.

To perform training, the network receives input vectors

{\vec{d}}_{0} \in C_{0}

and vectors

{\vec{d}}_{1} \in C_{1}

and corresponding labels

{0, 1}

as desired output belonging to the Learning set. As both

{\vec{d}}_{0}

and

{\vec{d}}_{1}

are computed as differences of two MFCC features, the input vectors consist of 50 real numbers. The network was trained by setting to 100 the maximum number of epochs.

In the testing phase, the network receives as input vectors

{\vec{d}}_{0} \in C_{0}

and vectors

{\vec{d}}_{1} \in C_{1}

belonging to the “test set” of the system database. Accordingly, for each input vector, it provides an output class y in the set

{0, 1}

. If the output class is equal to the class from which the vector was taken, we have a correct classification; otherwise, we have a misclassification. The network performs the final class attribution by means of a threshold-based step acting on the value assumed by the single node in output layer

y_{0}

. If the value

y_{0} > T_{h}

, class “1” is attributed; otherwise, “0”. The value of

T_{h}

can be appropriately tuned to obtain specific goals, like equal error rate (EER) between false positive and false negative, as we have considered in our approach.

3.7. Post-Processing

To enhance the method’s robustness and accuracy, even when environmental noise is captured by the electronic stethoscope, a filter known as “recurrence filter” [53] is employed in the decision block. This filter improves performance by analyzing a series of

L + 1

subsequent decisions made by the MLP,

y (m - L), \dots, y (m)

, ultimately producing the class

\hat{y} (m)

as the final output at the

(L + 1) th

iteration. More specifically, the filter acts on a circular vector (i.e., a queue with FIFO policy), which always contains the latest

(L + 1)

decisions. After a transition phase lasting

W_{a} \cdot (L + 1)

seconds, the filter responds every

W_{a}

seconds.

\hat{y} (m) = \{\begin{matrix} 0 & \sum_{l = 0}^{L} [y (m - l) = = 0] \geq \sum_{l = 0}^{L} [y (m - l) = = 1] \\ 1 & \sum_{l = 0}^{L} [y (m - l) = = 0] < \sum_{l = 0}^{L} [y (m - l) = = 1] \end{matrix}

(5)

The method then provides the first decision on the verification after L iterations, when the circular buffer fills up, i.e., after a “response time” lasting

R_{t} = W_{a} \cdot (L + 1)

seconds. We named “response time”

R_{t}

the delay introduced by the application of the “recurrence filter” as post-processing block. It is important to note this delay affects only the first response of the system; i.e., it is not cumulative, and subsequent responses can be provided every

W_{a}

seconds. Figure 6 shows the architecture of the recurrence filter.

In order to analyze the performance of the proposed approach both in terms of accuracy and response time, we analyzed the results varying L in the range [0–10]. Accordingly, the response time analyzed is always in the range [2–22] seconds, as reported in results shown in Section 4.3.

3.8. Analysis of Robustness to Environmental Noise

In order to analyze the system robustness during the testing phase, ambient noise of the “office noise” [54] and “babble noise” [55] types was digitally added to the PCG sequences in the testing database according to the Equation (6):

\begin{matrix} x_{n}^{n o i s e} = x_{n} + r_{n}, \end{matrix}

(6)

where

x_{n}

represents the samples of the digital PCG audio signal split into two-second chunks;

r_{n}

represents the samples of the ambient noise split into two-second chunks;

x_{n}^{n o i s e}

represents the samples of the digital PCG audio signal with added noise. In particular, three noise testing datasets were created:

The first testing dataset contains the addition of “office noise”. The noise is normalized to a power of −42 dBFS so that the signal-to-noise ratio (SNR) is 15 dB. The noise testing database is denominated “NT-DB-Off-SNR-15dB”.
The second dataset contains the addition of “babble noise”. The noise is normalized to a power of −47 dBFS so that the SNR is 20 dB. The obtained test dataset is denominated “NT-DB-Bab-SNR-20dB”.
The third dataset also contains the addition of “babble noise”, but, in this case, the noise is normalized to a power of −57 dBFS so as to have an SNR equal to 30 dB. The obtained testing database is denominated “NT-DB-Bab-SNR-30dB”.

Figure 7 shows an example of chunk affected by the three ambient noises both in terms of signal in the time domain and Mel spectrogram. Specifically we report a chunk and related Mel spectrogram obtained from the 2nd recording of a female subject

F_{x, 2}

. The graphs in the first row (Figure 7a–c) show in blue

x_{n}^{n o i s e}

and in red

r (n)

, where the latter includes “office noise” with SNR = 15 dB, “babble noise” with SNR = 20 dB, and “babble noise” with SNR = 30 dB, respectively. The graphs in the second row (Figure 7d–f) show the Mel spectrograms and the features computed for the chunks

x_{n}^{n o i s e}

in the corresponding column of the first row.

In order to make the system uniform in terms of audio signal strength, power normalization to −27 dBFS (initial average power of all PCG chunks) is reapplied to all testing datasets.

Once the datasets are created, they undergo the feature extraction process; consequently, the biometric algorithm for identity verification was applied as described in the Section 3.3, Section 3.4, Section 3.5, Section 3.6 and Section 3.7. The MFCC vectors obtained for each class have been fed to the system as input to evaluate its robustness to environmental noise.

3.9. Performance Metrics

To evaluate the performance of the proposed architecture, we analyzed the confusion matrices considering positive samples obtained from the same subject (class

C_{1}

) and negative samples obtained from different subjects (class

C_{0}

). Consequently, we can identify, comparing the true and predicted classes, the true positive (TP), true negative (TN), false positive (FP), and false negative (FN) outcomes.

The binary classifier requires a threshold value setting in order to maximize its performance. Performance has to be evaluated taking into account the target of the classification task.

Specifically, in a binary classifier, two parameters and their complimentary features can be taken into account:

True positive rate (TPR), also referred to as recall or sensitivity, is the probability of a positive test result, conditioned on the test item truly being positive $T P R = R = \frac{T P}{T P + F N}$ .
True negative rate (TNR), also referred to as specificity, is the probability of a negative test result, conditioned on the test item truly being negative $T N R = S = \frac{T N}{F P + T N}$ .
The false negative rate (FNR) is the proportion of actual positive cases that result in negative test outcomes. It corresponds to the complementarity of TPR: $F N R = 1 - T P R$ .
The false positive rate (FPR) is the probability of obtaining a positive result when the true value is negative (namely, the probability a false alarm is raised). It corresponds to the complementarity of TNR: $F P R = 1 - T N R$ .

A receiver operating characteristic (ROC) curve is a graphical representation that shows how well a binary classifier model performs across different threshold values (and can also be applied to multiclass classification). The ROC curve plots TPR versus FPR for each threshold setting. The area under the ROC curve (AUC) indicates the probability that the model will correctly rank a randomly selected positive example higher than a randomly selected negative example. AUC is a valuable metric for comparing the performance of two different models, provided that the dataset is relatively balanced. AUC varies in the range from 0.5 (worst model) to 1 (perfect model). The model with the larger area under the curve is typically considered better in the comparison. If a specific FPR is required, one can set the threshold in order to maximize the TPR at the desired FPR; conversely, if a specific TPR is required, one can set the threshold in order to minimize the FPR at the desired TPR.

A detection error tradeoff (DET) graph is a visual representation of error rates in binary classification systems, displaying the FNR against the FPR. It can be used to set the threshold and, in particular, permits to simply point out the condition of equal error rate (EER).

In addition to the already defined performance indexes “recall” and “specificity”, we decided to add the subsequent ones to analyze the performance in noisy conditions:

Precision: the ability to properly identify positive samples $P = \frac{T P}{T P + T N}$ .
Negative predictive value (NPV): the ability to properly identify negative samples $N P V = \frac{T N}{F N + T N}$ .
Accuracy: the fraction of all predictions correctly classified $A = \frac{T P + F P}{T P + F P + F N + T N}$ .
F1-score: the harmonic mean between precision and recall $F 1 = 2 \frac{P \times R}{P + R} = \frac{2 \times T P}{2 \times T P + F P + F N}$ .

4. Results

In this section, the results obtained by feeding the MFCCs to the neural network are presented and discussed. In addition, the robustness to the environmental noise (“office” and “babble noise”) of the proposed system is evaluated as the different signal-to-noise ratio values vary. First of all, we show in Figure 8 the output of the loss function during the training phase for each epoch. This figure shows that the loss function decreases to near 0 after 70 epochs. This convergence indicates that the proposed model has rapidly learned the hidden rules for separating the two classes. The time of learning was only a few minutes on a general-purpose laptop equipped with an “Intel core -i7” processor (“Intel Corporation” Santa Clara, CA, USA).

4.1. System Performance Without Application of the “Recurrence Filter” and Threshold Setting

Figure 9 shows the ROC curve obtained with the testing set portion of the system database and

W_{a}

analysis window of 2 s (i.e., without the application of the “recurrence filter”). The AUC approaches 1 very closely.

Figure 10 shows the DET curve obtained with the testing set portion of the system database and

W_{a}

analysis window of 2 s (i.e., without the application of the “recurrence filter”). In this study, we decided to adopt EER to fix the threshold. As depicted in Figure 10, the value of EER corresponds to 5%.

Upon fixing the threshold, a confusion matrix enables the visualization of the performance of a classification algorithm in the most exhaustive manner.

Figure 11a shows the confusion matrix obtained by classifying the 13,879 items belonging to class 0 and 13,726 items belonging to class 1 of the testing set using the HSCT-11 dataset without “recurrence filtering”. The row-normalized row summary at the bottom of the confusion matrix displays the percentages of correctly and incorrectly classified observations for each true class (precision, negative predicted values, and their opposites). The column-normalized column summary at the right of the confusion matrix displays the percentages of correctly and incorrectly classified observations for each predicted class (recall, specificity, and their opposites).

4.2. System Performance After the Application of the “Recurrence Filter”

The utilization of the “recurrence filter” permits significantly increasing the performance, as depicted in Figure 11b–d, showing the confusion matrix with filter lengths equal to

L = 2

(response time 6 s),

L = 4

(response time 10 s), and

L = 6

(response time 14 s), respectively. It must be noted that the number of test items is slightly decreased as the length of the filter grows. This behavior is related to the fill-up delay of the filter, which involves discarding

L - 1

elements.

4.3. System Performance in Noisy Conditions

Figure 12 shows the performance indexes of the system as the length of the “recurrence filter” changes relative to the four testing datasets, HSCT-11 (blue curve), NT-DB-Off-SNR-15dB (purple curve), NT-DB-Babb-SNR-20dB (yellow curve), and NT-DB-Babb-SNR-30dB (orange curve).

All the performance indexes improve significantly upon increasing the length of the “recurrence filter”. Specifically, as indicated in Figure 12e, for analysis windows of 2–4 s, the classification accuracy (95%) obtained from testing the HSCT-11 dataset is a few percentage points higher than that obtained from testing the other datasets. This divergence is higher taking into account other performance indexes. As the analysis window increases, however, all the testing datasets tend towards 100% for all the performance indexes, particularly when the length of the “recurrence filter” reaches 22 s.

It should be pointed out that, despite the fact that the system was trained exclusively with clean PCG sequences, i.e., without noise, the graphs show that the system is robust to office noise and babble noise.

In general, all the performance indexes are higher than 90% for response times higher than 10 s, and they are higher than 95% for response times higher then 15 s irrespective of the type of noise.

5. Discussion

In this section, we discuss the results obtained and compare them with state-of-the-art PCG-based human identity verification systems. It must be noted that, to perform a fair comparison, it is mandatory to consider studies using the same dataset and biometric method as it would be unreliable to compare studies using different approaches.

Those performances evaluated on a dataset consisting of PCG recordings from a high number of subjects are statistically more significant. At this time, HSCT-11 contains PCG recordings from 206 subjects, and, in terms of the number of subjects, it overcomes all other similar datasets [30]. To the best of our knowledge, those papers using HSCT-11 to build and evaluate a human identity verification system based on PCG recordings include [34,36]. The first observation in comparison with the latter studies is that our approach introduced a neural network model to perform human identity verification based on PCG biometrics. Furthermore, this study analyzed the performance of the verification approach in the presence of environmental noise (obtaining an accuracy ranging from 89.10% to 100% when varying the type of noise and observation window).

Comparing the values of the performance parameters, Ref. [34] achieved an EER = 13.66% by means of a statistical approach based on GMM as the classification model and features an array consisting of 16LFCC+

16 Δ

LFCC+E+

Δ

E.

The authors of [36] proposed a different set of features with the aim to increase the performance of the human identity verification system proposed in [34]. They obtained an EER = 11.16% using modified Mel-Frequency Cepstral Coefficients (M-MFCCs) and an EER = 8.73% using Wavelet Packet Decomposition (WPD) denoted as Wavelet Packet Cepstral Coefficients (WPCCs).

By means of the use of 50 MFCCs and our approach based on array differences and MLP, we obtained an EER = 5%, which is significantly lower than the values obtained in previous works. This last result was obtained by analyzing a set of signals lasting only 2 s. We proved that the error rate can be further reduced, obtaining accuracy, precision, and recall values almost equal to 100% (i.e., EER approaching 0%), thus increasing the length of the “recurrent filter” applied at the output of the MLP (and, accordingly, analyzing longer sets of signals lasting from 2 to 22 s). Finally, we demonstrated the robustness of our system to additive environmental noise, which has not been considered in similar previous works.

Although it does not use the HSCT-11 dataset and the approach used to acquire the signal is significantly different from ours (it uses sounds captured at the ear level by in-ear microphones), we proposed a qualitative comparison versus [51], a recent paper that addressed human identity verification by means of PCG. In that paper, the performance was evaluated by examining the false accepted rate (FAR) and the false rejected rate (FRR). A 5-fold cross-validation experiment (20% training and 80% testing) was conducted in a static scenario involving 45 participants. Overall, in clean conditions, the authors reported average FAR and FRR values of 1.6% and 1.8%, respectively. These results appear to be better than ours, but one should take into account that the duration of the recordings was set in the range from 30 s to 5 min, significantly higher than our original 2 s. Moreover, the authors claimed a noticeable decline in their system’s overall performance in the presence of ambient noise, while our system appears to be robust to additive environment noise.

In general, comparing our approach to the HeartPrint system [51], several notable advantages can be observed:

The simplicity of the proposed system is evident regarding the use of a compact feature set (only MFCCs) and the absence of signal alignment requirements, thus facilitating the creation of more economical and energy-efficient devices.
The performance is comparable (and also better) by analyzing shorter audio signals, resulting in faster authentication and reduced battery drain for devices not connected to power sources.
The robustness to ambient noise at very low SNRs enables effective use in crowded environments such as stations and airports.

6. Conclusions

The present study proposed a biometric verification method using MFCC parameters extracted from PCG audio signals and an MLP network. The system offers an overall accuracy ranging from 97.89% to 100% and a maximum EER of 5% in the case of PCG signals without noise. By applying ambient noise to PCG signals with different SNRs, the obtained accuracy ranges from 89% to 100% based on a “response time” between 2 and 22 s.

All the other performance indexes evaluated (recall, specificity, precision, negative predictive value, and F1-score) confirm the quality of the proposed verification approach.

The future developments could focus on performance analysis by training more advanced neural networks and also comparing the results in the case of RAW input sequences, thus reducing complexity.

Author Contributions

Conceptualization, R.A., F.B. and S.S.; methodology, R.A., F.B. and S.S.; software, R.A. and S.S.; validation, R.A., F.B. and S.S.; formal analysis, R.A., F.B. and S.S.; investigation, R.A., F.B. and S.S.; resources, R.A., F.B. and S.S.; data curation, R.A., F.B. and S.S.; writing—original draft preparation, R.A. and S.S.; writing—review and editing, R.A., F.B. and S.S.; visualization, R.A., F.B. and S.S.; supervision, F.B. and S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

We would like to extend our gratitude to Beritelli et al. [34] for making their heart sound dataset publicly available, which was fundamental for the continuation and extension of our work. Their contributions to the field have significantly shaped the research presented in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AFIS	Automated Fingerprint Identification System
ANC	Active Noise Cancellation
AUC	Area Under the Curve
CNN	Convolutional Neural Network
DET	Detection Error Tradeoff
EER	Equal Error Rate
ECG	Electrocardiogram
FNR	False Negative Rate
FPR	False Positive Rate
IEMs	In-Ear Monitors
MFCC	Mel-Frequency Cepstral Coefficient
MLP	Multilayer Perceptron
NPV	Negative Predictive Value
PCG	Phonocardiogram
ROC	Receiver Operating Characteristic
SNR	Signal-to-Noise Ratio
TNR	True Negative Rate
TPR	True Positive Rate

References

Rui, Z.; Yan, Z. A survey on biometric authentication: Toward secure and privacy-preserving identification. IEEE Access 2018, 7, 5994–6009. [Google Scholar] [CrossRef]
Galbally, J.; Haraksim, R.; Beslay, L. A study of age and ageing in fingerprint biometrics. IEEE Trans. Inf. Forensics Secur. 2018, 14, 1351–1365. [Google Scholar] [CrossRef]
Luo, Z.; Li, J.; Zhu, Y. A deep feature fusion network based on multiple attention mechanisms for joint iris-periocular biometric recognition. IEEE Signal Process. Lett. 2021, 28, 1060–1064. [Google Scholar] [CrossRef]
Rathore, A.S.; Li, Z.; Zhu, W.; Jin, Z.; Xu, W. A survey on heart biometrics. ACM Comput. Surv. (CSUR) 2020, 53, 1–38. [Google Scholar] [CrossRef]
Zhang, X.; Cheng, D.; Jia, P.; Dai, Y.; Xu, X. An efficient android-based multimodal biometric authentication system with face and voice. IEEE Access 2020, 8, 102757–102772. [Google Scholar] [CrossRef]
Kabir, M.M.; Mridha, M.F.; Shin, J.; Jahan, I.; Ohi, A.Q. A survey of speaker recognition: Fundamental theories, recognition methods and opportunities. IEEE Access 2021, 9, 79236–79263. [Google Scholar] [CrossRef]
Sundararajan, K.; Woodard, D.L. Deep learning for biometrics: A survey. ACM Comput. Surv. (CSUR) 2018, 51, 1–34. [Google Scholar] [CrossRef]
Gayathri, M.; Malathy, C. Novel framework for multimodal biometric image authentication using visual share neural network. Pattern Recognit. Lett. 2021, 152, 1–9. [Google Scholar] [CrossRef]
Maltoni, D.; Maio, D.; Jain, A.K.; Prabhakar, S. Handbook of Fingerprint Recognition; Springer: Berlin/Heidelberg, Germany, 2009; Volume 2. [Google Scholar]
Haitsma, J.; Kalker, T. A highly robust audio fingerprinting system. In Proceedings of the Ismir, Paris, France, 13–17 October 2002; Volume 2002, pp. 107–115. [Google Scholar]
Serrano, S.; Sahbudin, M.A.B.; Chaouch, C.; Scarpa, M. A new fingerprint definition for effective song recognition. Pattern Recognit. Lett. 2022, 160, 135–141. [Google Scholar] [CrossRef]
Sahbudin, M.A.B.; Chaouch, C.; Scarpa, M.; Serrano, S. IoT based song recognition for FM radio station broadcasting. In Proceedings of the 2019 7th International Conference on Information and Communication Technology (ICoICT), Kuala Lumpur, Malaysia, 24–26 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–6. [Google Scholar] [CrossRef]
Chaouch, C.; Sahbudin, M.A.B.; Scarpa, M.; Serrano, S. Audio fingerprint database structure using k-modes clustering. J. Adv. Res. Dyn. Control Syst. 2020, 12, 1545–1554. [Google Scholar] [CrossRef]
Sahbudin, M.A.B.; Chaouch, C.; Scarpa, M.; Serrano, S. Audio fingerprint based on power spectral density and hamming distance measure. J. Adv. Res. Dyn. Control Syst. 2020, 12, 1533–1544. [Google Scholar] [CrossRef] [PubMed]
Bhatt, H.S.; Bharadwaj, S.; Singh, R.; Vatsa, M. Recognizing surgically altered face images using multiobjective evolutionary algorithm. IEEE Trans. Inf. Forensics Secur. 2012, 8, 89–100. [Google Scholar] [CrossRef]
Bud, A. Facing the future: The impact of Apple FaceID. Biom. Technol. Today 2018, 2018, 5–7. [Google Scholar] [CrossRef]
Thavalengal, S.; Andorko, I.; Drimbarean, A.; Bigioi, P.; Corcoran, P. Proof-of-concept and evaluation of a dual function visible/NIR camera for iris authentication in smartphones. IEEE Trans. Consum. Electron. 2015, 61, 137–143. [Google Scholar] [CrossRef]
Pacut, A.; Czajka, A. Aliveness detection for iris biometrics. In Proceedings of the 40th Annual 2006 International Carnahan Conference on Security Technology, Lexington, Kentucky, 16–19 October 2006; IEEE: Piscataway, NJ, USA, 2006; pp. 122–129. [Google Scholar]
Yan, Z.; Zhao, S. A usable authentication system based on personal voice challenge. In Proceedings of the 2016 International Conference on Advanced Cloud and Big Data (CBD), Chengdu, China, 13–16 August 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 194–199. [Google Scholar]
Jayamaha, R.M.M.; Senadheera, M.R.; Gamage, T.N.C.; Weerasekara, K.P.B.; Dissanayaka, G.A.; Kodagoda, G.N. VoizLock—Human Voice Authentication System using Hidden Markov Model. In Proceedings of the 2008 4th International Conference on Information and Automation for Sustainability, Colombo, Sri Lanka, 12–14 December 2008. [Google Scholar]
Gałka, J.; Masior, M.; Salasa, M. Voice authentication embedded solution for secured access control. IEEE Trans. Consum. Electron. 2014, 60, 653–661. [Google Scholar] [CrossRef]
Huang, Y.; Yang, G.; Wang, K.; Yin, Y. Multi-view discriminant analysis with sample diversity for ECG biometric recognition. Pattern Recognit. Lett. 2021, 145, 110–117. [Google Scholar] [CrossRef]
Kim, H.; Phan, T.Q.; Hong, W.; Chun, S.Y. Physiology-based augmented deep neural network frameworks for ECG biometrics with short ECG pulses considering varying heart rates. Pattern Recognit. Lett. 2022, 156, 1–6. [Google Scholar] [CrossRef]
Al-Jibreen, A.; Al-Ahmadi, S.; Islam, S.; Artoli, A.M. Person identification with arrhythmic ECG signals using deep convolution neural network. Sci. Rep. 2024, 14, 4431. [Google Scholar] [CrossRef]
Jekova, I.; Krasteva, V.; Schmid, R. Human identification by cross-correlation and pattern matching of personalized heartbeat: Influence of ECG leads and reference database size. Sensors 2018, 18, 372. [Google Scholar] [CrossRef]
Labati, R.D.; Piuri, V.; Rundo, F.; Scotti, F. MultiCardioNet: Interoperability between ECG and PPG biometrics. Pattern Recognit. Lett. 2023, 175, 1–7. [Google Scholar] [CrossRef]
Tirado-Martin, P.; Sanchez-Reillo, R. BioECG: Improving ECG biometrics with deep learning and enhanced datasets. Appl. Sci. 2021, 11, 5880. [Google Scholar] [CrossRef]
Aleidan, A.A.; Abbas, Q.; Daadaa, Y.; Qureshi, I.; Perumal, G.; Ibrahim, M.E.; Ahmed, A.E. Biometric-based human identification using ensemble-based technique and ECG signals. Appl. Sci. 2023, 13, 9454. [Google Scholar] [CrossRef]
Meitei, T.G.; Singh, S.A.; Majumder, S. PCG-Based Biometrics. In Handbook of Research on Information Security in Biomedical Signal Processing; IGI Global: Hershey, PA, USA, 2018; pp. 1–25. [Google Scholar]
El-Dahshan, E.S.A.; Bassiouni, M.M.; Sharvia, S.; Salem, A.B.M. PCG signals for biometric authentication systems: An in-depth review. Comput. Sci. Rev. 2021, 41, 100420. [Google Scholar] [CrossRef]
Abo-Zahhad, M.; Ahmed, S.M.; Abbas, S.N. Biometric authentication based on PCG and ECG signals: Present status and future directions. Signal Image Video Process. 2014, 8, 739–751. [Google Scholar] [CrossRef]
Beritelli, F.; Serrano, S. Biometric identification based on frequency analysis of cardiac sounds. IEEE Trans. Inf. Forensics Secur. 2007, 2, 596–604. [Google Scholar] [CrossRef]
Beritelli, F.; Spadaccini, A. Human identity verification based on mel frequency analysis of digital heart sounds. In Proceedings of the 2009 16th International Conference on Digital Signal Processing, Santorini, Greece, 5–7 July 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 1–5. [Google Scholar] [CrossRef]
Spadaccini, A.; Beritelli, F. Performance evaluation of heart sounds biometric systems on an open dataset. In Proceedings of the 2013 18th International Conference on Digital Signal Processing (DSP), Fira, Greece, 1–3 July 2013; pp. 1–5. [Google Scholar] [CrossRef]
Bassiouni, M.; Khalefa, W.; El-Dahshan, E.; Salem, A. An Intelligent Approach for Person Identification Using Phonocardiogram Signals. Intelligence 2016, 6, 103–117. [Google Scholar]
Abbas, S.N.; Abo-Zahhad, M.; Ahmed, S.M.; Farrag, M. Heart-ID: Human identity recognition using heart sounds based on modifying mel-frequency cepstral features. IET Biom. 2016, 5, 284–296. [Google Scholar] [CrossRef]
Fariza, I.N.; Salleh, S.H.; Noman, F.; Hussain, H. Human identification based on heart sound auscultation point. J. Teknol. 2017, 79. [Google Scholar] [CrossRef]
Khan, M.U.; Aziz, S.; Zainab, A.; Tanveer, H.; Iqtidar, K.; Waseem, A. Biometric system using PCG signal analysis: A new method of person identification. In Proceedings of the 2020 International Conference on Electrical, Communication, and Computer Engineering (ICECCE), Istanbul, Turkey, 12–13 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar]
Cheng, X.; Wang, P.; She, C. Biometric identification method for heart sound based on multimodal multiscale dispersion entropy. Entropy 2020, 22, 238. [Google Scholar] [CrossRef]
Beritelli, F. A multiband approach to human identity verification based on phonocardiogram signal analysis. In Proceedings of the 2008 Biometrics Symposium, Tampa, FL, USA, 23–25 September 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 71–76. [Google Scholar] [CrossRef]
Beritelli, F.; Spadaccini, A. An improved biometric identification system based on heart sounds and gaussian mixture models. In Proceedings of the 2010 IEEE Workshop on Biometric Measurements and Systems for Security and Medical Applications, Taranto, Italy, 9 September 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 31–35. [Google Scholar] [CrossRef]
Beritelli, F.; Spadaccini, A. A statistical approach to biometric identity verification based on heart sounds. In Proceedings of the 2010 Fourth International Conference on Emerging Security Information, Systems and Technologies, Venice, Italy, 18–25 July 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 93–96. [Google Scholar] [CrossRef]
Abo-Zahhad, M.; Ahmed, S.M.; Abbas, S.N. PCG biometric identification system based on feature level fusion using canonical correlation analysis. In Proceedings of the 2014 IEEE 27th Canadian Conference on Electrical and Computer Engineering (CCECE), Toronto, ON, Canada, 4–7 May 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 1–6. [Google Scholar]
Abo-Zahhad, M.; Ahmed, S.M.; Abbas, S.N. A new biometric authentication system using heart sounds based on wavelet packet features. In Proceedings of the 2015 IEEE International Conference on Electronics, Circuits, and Systems (ICECS), Cairo, Egypt, 6–9 December 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 17–20. [Google Scholar]
Abo-Zahhad, M.; Farrag, M.; Abbas, S.N.; Ahmed, S.M. A comparative approach between cepstral features for human authentication using heart sounds. Signal Image Video Process. 2016, 10, 843–851. [Google Scholar] [CrossRef]
Chen, T.E.; Yang, S.I.; Ho, L.T.; Tsai, K.H.; Chen, Y.H.; Chang, Y.F.; Lai, Y.H.; Wang, S.S.; Tsao, Y.; Wu, C.C. S1 and S2 heart sound recognition using deep neural networks. IEEE Trans. Biomed. Eng. 2016, 64, 372–380. [Google Scholar]
Fahad, I.; Apu, M.A.R.; Ghosh, A.; Fattah, S.A. Phonocardiogram heartbeat segmentation and autoregressive modeling for person identification. In Proceedings of the TENCON 2019–2019 IEEE Region 10 Conference (TENCON), Kerala, India, 17–20 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1942–1946. [Google Scholar]
Tagashira, M.; Nakagawa, T. Biometric authentication based on auscultated heart sounds in healthcare. IAENG Int. J. Comput. Sci. 2020, 47, 343–349. [Google Scholar]
Lee, S.H.; Kim, Y.S.; Yeo, W.H. Soft wearable patch for continuous cardiac biometric security. Eng. Proc. 2021, 10, 73. [Google Scholar] [CrossRef]
Lee, S.H.; Lee, Y.J.; Kwon, K.; Lewis, D.; Romero, L.; Lee, J.; Zavanelli, N.; Yan, E.; Yu, K.J.; Yeo, W.H. Soft Smart Biopatch for Continuous Authentication-Enabled Cardiac Biometric Systems. Adv. Sens. Res. 2023, 2, 2300074. [Google Scholar] [CrossRef]
Cao, Y.; Cai, C.; Li, F.; Chen, Z.; Luo, J. Enabling Passive User Authentication via Heart Sounds on In-Ear Microphones. In IEEE Transactions on Dependable and Secure Computing; IEEE: Piscataway, NJ, USA, 2024. [Google Scholar]
Avanzato, R.; Beritelli, F. Hydrogeological risk management in smart cities: A new approach to rainfall classification based on LTE cell selection parameters. IEEE Access 2020, 8, 137161–137173. [Google Scholar] [CrossRef]
Avanzato, R.; Beritelli, F. Heart Sound Multiclass Analysis Based on Raw Data and Convolutional Neural Network. IEEE Sens. Lett. 2020, 4, 1–4. [Google Scholar] [CrossRef]
Thiemann, J.; Ito, N.; Vincent, E. DEMAND: A collection of multi-channel recordings of acoustic noise in diverse environments. In Proceedings of the Proc. Meetings Acoust, Montreal, QC, Canada, 7–8 June 2013; pp. 1–6. [Google Scholar]
Varga, A.; Steeneken, H.J. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 1993, 12, 247–251. [Google Scholar] [CrossRef]

Figure 1. Block diagram of biometric authentication modes: (a) identification mode; (b) verification mode.

Figure 2. Examples of PCG recordings for 2 subjects: (a) 1st PCG recording of a female subject

F_{x, 1}

, (b) 2nd PCG recording of a female subject

F_{x, 2}

, (c) 1st PCG recording of a male subject

M_{y, 1}

, and (d) 2nd PCG recording of a male subject

M_{y, 2}

.

Figure 2. Examples of PCG recordings for 2 subjects: (a) 1st PCG recording of a female subject

F_{x, 1}

, (b) 2nd PCG recording of a female subject

F_{x, 2}

, (c) 1st PCG recording of a male subject

M_{y, 1}

, and (d) 2nd PCG recording of a male subject

M_{y, 2}

.

Figure 3. Details (extracted at t = 20 s and lasting 2 s) of PCG recordings for 2 subjects: (a) 1st PCG recording of a female subject

F_{x, 1}

, (b) 2nd PCG recording of a female subject

F_{x, 2}

, (c) 1st PCG recording of a male subject

M_{y, 1}

, and (d) 2nd PCG recording of a male subject

M_{y, 2}

.

Figure 3. Details (extracted at t = 20 s and lasting 2 s) of PCG recordings for 2 subjects: (a) 1st PCG recording of a female subject

F_{x, 1}

, (b) 2nd PCG recording of a female subject

F_{x, 2}

, (c) 1st PCG recording of a male subject

M_{y, 1}

, and (d) 2nd PCG recording of a male subject

M_{y, 2}

.

Figure 4. Mel-frequency spectrograms and image representation of the related averaged vector

\vec{c}

evaluated for (a) 1st PCG chunk for female subject

F_{x}

, (b) 2nd PCG chunk for female subject

F_{x}

, (c) 1st PCG chunk for male subject

M_{y}

, and (d) 2nd PCG chunk for male subject

M_{y}

.

Figure 4. Mel-frequency spectrograms and image representation of the related averaged vector

\vec{c}

evaluated for (a) 1st PCG chunk for female subject

F_{x}

, (b) 2nd PCG chunk for female subject

F_{x}

, (c) 1st PCG chunk for male subject

M_{y}

, and (d) 2nd PCG chunk for male subject

M_{y}

.

Figure 5. Architecture of MLP binary classifier implemented to perform human identity verification by means of PCG features extracted from a segment of audio recording lasting 2 s; the output

y (m)

is true for identity verified, false for identity not verified.

Figure 5. Architecture of MLP binary classifier implemented to perform human identity verification by means of PCG features extracted from a segment of audio recording lasting 2 s; the output

y (m)

is true for identity verified, false for identity not verified.

Figure 6. Architecture of the recurrence filter.

Figure 7. Example of chunk affected by noises (2nd PCG recording of a female subject

F_{x, 2}

): (a) signal plus office noise, SNR = 15 dB (in red), (b) signal plus babble noise, SNR = 20 dB (in red), (c) signal plus babble noise, SNR = 30 dB (in red), (d) Mel spectrogram and features for chunk in (a), (e) Mel spectrogram and features for chunk in (b), and (f) Mel spectrogram and features for chunk in (c).

Figure 7. Example of chunk affected by noises (2nd PCG recording of a female subject

F_{x, 2}

): (a) signal plus office noise, SNR = 15 dB (in red), (b) signal plus babble noise, SNR = 20 dB (in red), (c) signal plus babble noise, SNR = 30 dB (in red), (d) Mel spectrogram and features for chunk in (a), (e) Mel spectrogram and features for chunk in (b), and (f) Mel spectrogram and features for chunk in (c).

Figure 8. Output of the loss function during the training phase for each epoch.

Figure 9. ROC curve for HSCT-11 database without application of the “recurrence filter”.

Figure 10. DET curve for HSCT-11 database without application of the “recurrence filter”. Red dashed line corresponds to the EER.

Figure 11. Confusion matrices varying “recurrence filter” length obtained with the testing set portion of the system database: (a)

L = 0

: no filtering, response time 2 s; (b)

L = 2

: response time 6 s; (c)

L = 4

: response time 10 s; (d)

L = 6

: response time 14 s.

Figure 11. Confusion matrices varying “recurrence filter” length obtained with the testing set portion of the system database: (a)

L = 0

: no filtering, response time 2 s; (b)

L = 2

: response time 6 s; (c)

L = 4

: response time 10 s; (d)

L = 6

: response time 14 s.

Figure 12. Performance indexes varying “recurrence filter” length (response time) for the four datasets: (a) recall, (b) specificity, (c) precision, (d) negative predictive value, (e) accuracy, and (f) F1-score.

Table 1. Number of MFCC vectors relative to all PCG chunks for each class of the learning and testing sets after class element balancing.

Class	Learning Set	Testing Set
0	32,129	13,879
1	32,282	13,726

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Avanzato, R.; Beritelli, F.; Serrano, S. Robust Biometric Verification Using Phonocardiogram Fingerprinting and a Multilayer-Perceptron-Based Classifier. Electronics 2024, 13, 4377. https://doi.org/10.3390/electronics13224377

AMA Style

Avanzato R, Beritelli F, Serrano S. Robust Biometric Verification Using Phonocardiogram Fingerprinting and a Multilayer-Perceptron-Based Classifier. Electronics. 2024; 13(22):4377. https://doi.org/10.3390/electronics13224377

Chicago/Turabian Style

Avanzato, Roberta, Francesco Beritelli, and Salvatore Serrano. 2024. "Robust Biometric Verification Using Phonocardiogram Fingerprinting and a Multilayer-Perceptron-Based Classifier" Electronics 13, no. 22: 4377. https://doi.org/10.3390/electronics13224377

APA Style

Avanzato, R., Beritelli, F., & Serrano, S. (2024). Robust Biometric Verification Using Phonocardiogram Fingerprinting and a Multilayer-Perceptron-Based Classifier. Electronics, 13(22), 4377. https://doi.org/10.3390/electronics13224377

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Robust Biometric Verification Using Phonocardiogram Fingerprinting and a Multilayer-Perceptron-Based Classifier

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Background on Phonocardiogram Biometrics

3.2. PCG Dataset

3.3. Audio Signal Preprocessing

3.4. Feature Extraction

3.5. System Database

3.6. Classifier

3.7. Post-Processing

3.8. Analysis of Robustness to Environmental Noise

3.9. Performance Metrics

4. Results

4.1. System Performance Without Application of the “Recurrence Filter” and Threshold Setting

4.2. System Performance After the Application of the “Recurrence Filter”

4.3. System Performance in Noisy Conditions

5. Discussion

6. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI