*2.4. Network Optimization*

As mentioned previously, the complete architecture proposed in this study is a learnable end-to-end network using a backpropagation algorithm. If we define the output of the sigmoid function in the final layer of the trained network as *y*ˆ*i*, then the distribution of the network output *y*ˆ*<sup>i</sup>* follows a Bernoulli distribution. The determination of the weights *W* of the network, including those of the fingerprint and ECG branches, can be carried out by maximizing the following likelihood function:

$$L(D, W) = \prod\_{i=1}^{N} \, \p\_i^{y\_i} (1 - \mathcal{g}\_i)^{1 - y\_i} \, \tag{4}$$

which is equivalent to minimizing the following log-likelihood function:

$$L(D.W) = -\sum\_{i=1}^{N} y\_i \ln(\mathfrak{H}\_i) + (1 - y\_i) \ln(1 - \mathfrak{H}\_i). \tag{5}$$

The loss function in (5) is usually called a cross-entropy loss function. To optimize this loss, we use the RMSProp optimization algorithm proposed by Hinton [57], which is considered one of the most common adaptive gradient algorithms, dividing the gradient by averaging the magnitude of its recent movement as follows:

$$E\left[\mathbf{g}^2\right]\_t = \beta E\left[\mathbf{g}^2\right]\_{t-1} + (1-\beta)\left(\frac{\partial L}{\partial \mathcal{W}}\right)^2,\tag{6}$$

$$\mathcal{W}\_t = \mathcal{W}\_{t-1} - \alpha \left(\frac{\partial L}{\partial W}\right) \frac{1}{\sqrt{E[\mathcal{g}^2]\_t}},\tag{7}$$

where *E g*2 ! *<sup>t</sup>* represents a moving average of the squared gradients during the iteration process (*t*), and <sup>∂</sup>*<sup>L</sup>* <sup>∂</sup>*<sup>W</sup>* is known as the gradients of the loss function of the weights of the network *W*. Parameters α and β are the learning rate and moving average, respectively. During the experiments, parameter β is set to its default value (β = 0.9), whereas α is initially set to 0.0001 and is periodically decreased by a factor of 1/10 for every 20 epochs.
