*2.2. ECG Branch*

Regarding an ECG branch, we propose three different feature representation architectures, FC, 1D-CNN, and 2D-CNN, as shown in Figure 6. The FC architecture is composed of simple fully-connected layers followed by batch normalization (BN), a Swish activation function, and dropout regularization to reduce an overfitting, as presented in Figure 6a. The second architecture, 1D-CNN, is based on the application of 1D convolution operations on the ECG signals. With this architecture, the ECG signals are fed through three 1D consecutive convolutional layers, the first two layers of which are followed by a BN, Swish, dropout (0.25), and 1D average pooling, and the last layer is followed by a BN, Swish, and 1D average pooling, as shown in Figure 6b. The last architecture, which is one of the main contributions of the present paper, is based on the idea of converting ECG feature into a 2D features image using a generator module and then processing the signal using a standard 2D CNN network, as shown in Figure 6c. In particular, this architecture learns and reshapes a 1D ECG feature into a 2D image using fully connected layers. The resulting image is then fed to two consecutives MBConv blocks to obtain the final 2D representations. Transforming a 1D ECG feature into an image can play a significant role in achieving powerful 2D convolution and pooling operations when learning the appropriate ECG features. During the experiments, we show that this architecture allows the generation of a better representation compared to those based on an analysis of 1D features.

**Figure 6.** Details of ECG feature extraction architectures in ECG branch: (**a**) FC (**b**) 1D-CNN, and (**c**) 2D-CNN.
