**2. Proposed Methodology**

Assume a fingerprint dataset *D* = *Xi*, *yi N <sup>i</sup>*=<sup>1</sup> composed of *N* = *A* + *B* (where *A* is the number of artefact samples and *B* is the number of bona fide samples), where *Xi* represents the input fingerprint image and *yi* is a binary label indicating if a fingerprint is an artefact or a bona fide (real). The aim of ordinary fingerprint PAD is to detect whether a fingerprint image is a PA (artefact) and differentiate it from a bona fide fingerprint sample. In this study, we consider ECG signals as an additional input modality to strengthen the fingerprint PAD system. To this end, the dataset becomes triplet *D* = *Xf <sup>i</sup>* , *<sup>X</sup><sup>e</sup> i* , *yi N i*=1 , where *X<sup>f</sup> <sup>i</sup>* is the fingerprint image and *<sup>X</sup><sup>e</sup> <sup>i</sup>* is the ECG signal.

Figure 2 shows the proposed fusion approach, which is composed of three parts, i.e., the fingerprint branch, the ECG branch, and a fusion module. Detailed descriptions for these branches are provided in the next subsections.

**Figure 2.** Overall architecture of the proposed end-to-end convolutional neural network-based (CNN) fusion architecture. ECG, electrocardiogram.

## *2.1. Fingerprint Branch*

A fingerprint branch uses state-of-the-art EfficientNets [53] to obtain the feature representations of a fingerprint as shown in Figure 3 EfficientNets are a family of models that were recently developed by the Google Brain team by applying a new model scaling method for balancing the depth, width, and resolution of the CNNs [53]. Their scaling method uniformly scales the dimensions of a network using a simple and efficient compound coefficient. The compound scaling method enables a baseline CNN network to be scaled up with respect to the available resources while maintaining a high efficiency and accuracy. EfficientNets include mobile inverted bottleneck convolution (MBConv) as the basic building block [54]. In addition, this network uses an attention mechanism based on squeeze excitation (SE) to improve feature representations. This attention layer starts by applying a global average pooling (GAP) after each block. This operation is then followed by a fully-connected layer (with weight *W*1) to reduce the number of dimensions by (1/16). The resulting feature vector *s* is then used to calibrate the feature maps of each channel (V) using a channel-wise scale operation with an extra fully-connected layer with weight *W*2. SE operates as shown below:

$$s = \text{Sign}(\mathcal{W}\_2(\text{ReLU}(\mathcal{W}\_1(V)))),\tag{1}$$

$$V\_{SE} = \mathbf{s} \odot V\_{\prime} \tag{2}$$

where *s* is the scaling factor, refers to the channel-wise multiplication, and *V* represents the feature maps obtained from a particular layer of the EfficientNet.

**Figure 3.** Flowchart of a fingerprint branch.

Furthermore, a novel activation function called Swish is used by an EfficientNet, which is essentially the sigmoid function multiplied by x according to Equation (3). Figure 4 shows the behavior of the following Swish activation function:

**Figure 4.** Swish activation function [55].

EfficientNet models surpass the accuracy of state-of-the-art CNN approaches on the ImageNet dataset [56] by minimizing the numbers of parameters and FLOPS, as shown in Figure 5. In this study, we investigate the baseline EfficientNet-B3 in terms of the feature representations of a fingerprint. To the best of our knowledge, this is the first time EfficientNets have been used for fingerprint PAD.

**Figure 5.** Comparison among EfficientNet and other popular CNN models in terms of ImageNet accuracy vs. model size [53].

During the experiments, we truncated EfficientNet-B3 by removing its 1000 softmax classification layer and used the output of the "swish\_78" layer as an input to a fusion module, which has the task of fusing the fingerprint and ECG features, as described later.
