**2. Materials**

#### *2.1. MIT-BIH Database*

The MIT-BIH database contains 48 sets of two-lead ECG signals (lead II and mostly V1). Each signal is approximately 30 min long, has been collected at a 360 Hz sampling frequency, and has been independently annotated by at least two cardiologists. Annotations include the 15 types listed in Table 1. In our study, we use the signals for which both II and V1 leads are available (see PhysioBank for further details).

#### *2.2. INCART Database*

The St. Petersburg Institute of Cardiological Techniques 12-lead arrhythmia database (INCART) contains 75 sets of 12-lead ECG signals (leads I, II, III, aVR, aVL, aVF, V1, V2, V3, V4, V5, V6). Each signal is approximately 30 min long and has been collected at a 257 Hz sampling frequency. We only consider leads II and V1 of each record. Annotations include the 7 types listed in Table 2.


**Table 1.** Details of the categories for MIT-BIH database.

**Table 2.** Details of the categories for INCART database.


## **3. Proposed Method**

The input of the proposed method is a two-channel ECG segmen<sup>t</sup> obtained after a preliminary segmentation that consists in the isolation of heartbeats in each record. Given the R-peak positions, any heartbeat is isolated by retaining *T*1 and *T*2 samples to the left and to the right of the R-peak, respectively. For each of the two leads, a vector denoted as *xh* (*h* = 1, 2) is built with the values of the ECG (in Volts), of size (*<sup>T</sup>*2 + *<sup>T</sup>*1). The values of *T*1 and *T*2 are 160–200 and 120–136 for MIT-BIH and INCART, respectively. These values are given in Table 3, and are similar to the one used in [20].

**Table 3.** Sampling rates, values of *T*1 and *T*2 for both MIT-BIH and INCART databases.


The architecture of the whole process is shown Figure 1a,b. The first stage is feature extraction. The input of the process is, for each lead, a vector *xh* (with *h* = 1, 2) containing raw values of the segmented heartbeat. Each lead *xh* (*h* = 1, 2) is normalized (see the "Normalization" module in Figure 1a by using a rescaling procedure so that the resulting vector *xh*,norm has an intensity that ranges from 0 to 1, as per equation:

$$\chi\_{\rm h,norm} = \frac{\chi\_{\rm h} - \min(\chi\_{\rm h})}{\max(\chi\_{\rm h}) - \min(\chi\_{\rm h})}.\tag{1}$$

At the same time, hand-crafted features are extracted from each lead *xh* (*h* = 1, <sup>2</sup>): frequency-domain features *xh*,spec, and autoregression features *xh*,ar. A single time-domain features vector *x*time is also computed for both leads. Frequency-domain features, autoregression features and the normalized segmented heartbeat *xh*,norm are concatenated to obtain the vector *xh*,cat = [*xh*,ar *xh*,spec *xh*,norm] (see Figure 1a). The *xh*,cat (*h* = 1, 2) vector is processed by the neural module to produce a single output vector *f*neur for the two leads (see Figure 1b). The vector *f*neur is then reduced in dimensions by using Principal Component Analysis (PCA) thus obtaining the vector *f*pca. The concatenation of the time-domain features *x*time and *f*pca is the input of a Support Vector Machine classifier. The output of the classifier is the predicted heartbeat class. In the following subsections the feature extraction and neural module are discussed more in detail.

**Figure 1.** Classification processing pipeline of our method. (**a**) Feature extraction: for each of the two leads *xh* (*h* = 1, <sup>2</sup>), hand-crafted features are extracted to build *xh*,cat (*h* = 1, 2) while *x*time, a vector of time-domain features, is built for both leads. (**b**) Part of these features, *xh*,cat (*h* = 1, <sup>2</sup>), feeds the 1-D CCANet-SVD module, which first outputs the neural features *f*neur and then a reduced version of the neural features *f*pca. The concatenation of *x*time and *f*pca feeds the classification module. The output is the predicted heartbeat class.

#### *3.1. Hand-Crafted Feature Extraction*

Given an isolated heartbeat *xh* (*h* = 1, <sup>2</sup>), hand-crafted features are extracted with three different methods: frequency-domain, time-domain, autoregressive modeling.

## 3.1.1. One-Dimensional Spectrogram

For the frequency domain, we use a one-dimensional spectrogram, which is a representation of the spectrum of frequencies of a signal as it varies with time. It is built through a short-time Fourier transform (STFT) of each of the two non-normalized leads *xh* (*h* = 1, <sup>2</sup>). A window slides through the signal (with potential overlapping) and computes at each step the squared magnitudes of the STFT of the portion of the signal belonging to the window. The Hamming windowing is used for this process. The spectrogram is then obtained by concatenating, along the time axis, the squared magnitudes acquired for each window. The squared magnitudes obtained for each frequency (up to half of the sampling rate) at each time step can be reported in a matrix where axis 0 and 1 are the frequency and time axis, respectively. Since the range of the squared magnitudes varies significantly, the resulting matrix is rescaled to [0, 1] to yield *Xh*,spec (*h* = 1, <sup>2</sup>).

A weighted average along the frequency axis is performed thus turning the *Xh*,spec matrix into a one-dimensional vector, *xh*,spec. The equation used is the following (2):

$$\alpha\_{h, \text{spec}} = \frac{1}{S(n)} \sum\_{k} \frac{1}{k^{\text{\textquotedblleft}}} X\_{h, \text{spec}}[k]\_{\text{\textquotedblleft}} \tag{2}$$

where *<sup>S</sup>*(*n*) = ∑ *k*1 *kn* .

This feature extraction method requires three parameters: the number of samples in the window of the STFT (*N*wind = 64 and 46 for MIT-BIH and INCART respectively), the number of samples in the overlap between two consecutive steps (*N*overlap= 32 and 23 for MIT-BIH and INCART respectively) and *n* (0.25 for both MIT-BIH and INCART), the weight parameter in Equation (2). Suitable parameters are found with a greedy search. The feature vector *xh*,spec is of size 10.

## 3.1.2. Autoregressive Modeling

Autoregressive (AR) modeling specifies that a time series value depends linearly on its own previous values and a stochastic term, as per Equation (3):

$$\mathbf{X}\_{t} = \sum\_{i=2}^{p+1} \varphi\_{i} \mathbf{X}\_{t-i+1} + \epsilon(t), \tag{3}$$

where **X***t* is the time series, *ϕi* are the AR coefficients computed with Yule-Walker's method and *p* is the order of the AR model. Since the choice of the order *p* depends highly on the sampling rate, non-normalized ECGs from both databases are resampled to 360 Hz [38]. The order was then chosen by performing best parameter search on the training data for both the MIT-BIH and INCART databases. We chose the order that maximized the average of our performance metrics (accuracy, specificity, sensitivity, ppv) on a validation set. Figure 2 shows that for the INCART data, the best order is 2 while the best order is 3 for the MIT-BIH data. Since the performance for MIT-BIH is quite comparable for orders 2 and 3, we chose an order equal to 2 for both datasets. We preferred a lower order to reduce the computational cost. The vector of AR coefficients obtained for each lead of one heartbeat, *xh* (*h* = 1, <sup>2</sup>), is denoted as *xh*,ar and is of size 2.

**Figure 2.** Mean performance for different AR orders on validation sets for MIT-BIH signals (red) and INCART signals (blue).

#### 3.1.3. Time-Domain Features

For each of the two leads *xh* (*h* = 1, 2) of one segmented heartbeat, we compute the following time-domain features: the median value of *xh*, its fourth order and fifth order central moments and the kurtosis of *xh*. Finally, for both leads, we build a single vector of time-domain features including the previous features for each lead and the heartbeat rate of the patient to whom the heartbeat belongs. The resulting vector is denoted as *x*time and is of size 9.

#### *3.2. Neural Feature Extraction*

To exploit the correlation between two ECG leads, we use a one-dimensional variant of the canonical correlation analysis network (CCANet). First introduced in the field of image recognition by Yang et al. [33], CCANet has been employed in two-view image recognition tasks. Recently, CCANets, which are intrinsically two-dimensional, have been successfully employed in the signal processing field for the classification of two and three lead heartbeats [20]. A CCANet is usually composed of two cascaded convolutional layers and an output layer: (1) in the convolutional layers, the CCA technique is used to extract dual-lead filter banks; (2) in the output layer, the features extracted from the second convolutional layer are mapped into the final feature vector [20].

In this paper, with the aim of increasing performance, we design a new 1-D canonical correlation analysis network that is composed of four 1-D convolutional layers and an output layer. Contrary to CCANet, the filters are found by combining a CCA with a singular value decomposition (SVD), and features are extracted after each layer. The use of 1-D convolutions instead of 2-D permits to limit computational cost, thus allowing to increasethenumberoflayersfromtwotofourand,consequently,toincreaseperformance.

The processing pipeline is shown in Figure 3. The input of the proposed 1-D CCANet-SVD is the concatenation of autoregressive features, spectrogram features, and the original normalized heartbeat, resulting in the following vector *xh*,cat = [*xh*,ar *xh*,spec *xh*,norm] ∈ R*<sup>m</sup>*, *h* = 1, 2. The 1-D CCANet-SVD is trained with *N* two-lead heartbeats and then used as neural feature extractor in combination with a linear SVM for heartbeat classification. The network is trained separately for the MIT-BIH and INCART databases.

**Figure 3.** Proposed 1-D CCANet-SVD.

#### 3.2.1. First Convolutional Layer

We denote *x* (*i*) *h*,*cat* the *i*-th element (*i* ∈ {1, ... , *m*}) of an input vector *xh*,*cat*. We selected a series of segments of size *k* centered on each value *x* (*i*) *h*,*cat*, to obtain the *m* following segments, *bh*,1, ... , *bh*,*<sup>m</sup>* ∈ R*k*. The latter are then zero-centered and concatenated to build a matrix of the segments [*bh*,1, ... , *bh*,*<sup>m</sup>*] ∈ R*k*<sup>×</sup>*m*. This procedure is performed on each of the *N* training heartbeats and the resulting matrices of segments are finally concatenated to obtain **X***h* ∈ <sup>R</sup>*k*×*Nm*, *h* = 1, 2. Note that our network is simultaneously fed with all the training heartbeats in order to build the two matrices **X**1and **X**2.

Let us address the filter extraction stage. In [20], the filters are found with a CCA, thus by maximizing the correlation between pairs of projected variables. The first projection direction can be obtained by optimizing Equation (4):

$$\max \rho(a\_1, b\_1) = a\_1^T S\_{12} b\_1 \tag{4}$$

with the constraints *aT*1 *S*11*a*1 = 1, *bT*1 *S*22*b*1 = 1, where *Shh* = (**<sup>X</sup>***h*)(**<sup>X</sup>***h*)*<sup>T</sup>*, and *a*1 and *b*1 are the first canonical vectors for each of the two leads. The Lagrange multiplier technique shows that *a*1 and *b*1 are eigenvectors of *M*1 = *S*−<sup>1</sup> 11 *<sup>S</sup>*12*S*−<sup>1</sup> 22 *S*21 and *M*2 = *S*−<sup>1</sup> 22 *<sup>S</sup>*21*S*−<sup>1</sup> 11 *S*12, respectively. Given the first *l* − 1 directions, the *l*-th projection direction can be calculated by solving problem (4) with the additional constraints *aTi S*11*al* = *bTi S*22*bl* = 0, (*i* < *l*). In the end, the *L*1 filters for the first lead are built by taking the *L*1 primary eigenvectors of *M*1 (i.e., associated with the *L*1 biggest eigenvalues), whereas the *L*1 filters for the second lead are built by considering the *L*1 primary eigenvectors of *M*2.

In this paper, we use a slightly different approach, referred to as the CCA-SVD filter extraction technique. We perform an SVD of both *M*1, and *M*2, as per *M*1 = *<sup>U</sup>*1*D*1*VT*1 and *M*2 = *<sup>U</sup>*2*D*2*VT*2 , where the *U* and *V* matrices are unitary, and the *D* matrices are diagonal with singular values on the diagonals. Using an SVD allows to retrieve the directions, which explain the most the variance of *M*1 and *M*2. Since these two matrices derive from the CCA, they capture the correlations between the two leads. Therefore, we use the directions found by performing an SVD on them to have the best explanation of the correlation between the two leads. Consequently, the *L*1 filters for the first lead are built by taking the columns of *U*1 that are associated with the *L*1 biggest singular values of *D*1, whereas the *L*1 filters for the second lead are built by considering the columns of *U*2 that are associated with the *L*1 biggest singular values of *D*2. Such an approach yields better results than the traditional CCA filter extraction technique (see *Experiments*). We denote as *<sup>W</sup>*1,*l* and *<sup>W</sup>*2,*l*, *l* = 1, ... , *L*1, the *L*1 filters of size *k* corresponding to the first and second lead, respectively.

As for the convolutions, for each lead *h*, each input signal *xh*,*cat* yields *L*1 outputs *xh*,*cat*,*<sup>l</sup>* = *xh*,*cat* ∗ *Wh*,*l*, *l* = 1, ... , *L*1. The length of the input and output signal were kept identical, thanks to a zero-padding of the input.

#### 3.2.2. First Extraction Stage

The extraction stage follows the same steps as in [20]. First, for each heartbeat, the output of the first convolution is converted to a decimal one-dimensional signal as per *T* = ∑*<sup>L</sup>*1 *l*=1 <sup>2</sup>*l*−1*H*([*<sup>x</sup>*1,*cat*,*l*, *<sup>x</sup>*2,*cat*,*<sup>l</sup>*]) ∈ <sup>R</sup>2*m*, where *H* is the Heaviside step function. Therefore, the range of each component of *T* is [0, 2*L*1 − 1]. *T* is then divided in *B* blocks of size *u*1. Each block can overlap with its neighbor, according to *R*1 ∈ [0, 1], an overlapping proportion parameter. For each of these blocks, a histogram with 2*L*1 bins is built. The values of the resulting histogram for each block is embedded in a 2*L*1 -long vector and the vectors provided by each block are then concatenated to obtain *Bhist*(*T*) ∈ R2*L*<sup>1</sup> *B*. The first feature vector, for the heartbeat, is *f* 1 = *Bhist*(*T*).

#### 3.2.3. Second Convolution Layer and Extraction Stage

The second layer is identical to the previous one, except for the fact that the input is different. Indeed, before the first convolution, each lead of a heartbeat was represented by a single vector of length *m*. After the first convolution, each lead is now represented by *L*1 vectors of length *m*. Let's walk through the second layer with the notations used so far.

The *xh*,*cat*,*<sup>l</sup>* = *xh*,*cat* ∗ *Wh*,*l*, *l* = 1, ... , *L*1 produced after the first convolutional layer are the input of the second layer. Since we initially considered *N* training heartbeats, it means that this layer has a total number of *N* × *L*1 input vectors corresponding to lead 1 and *N* × *L*1 input vectors corresponding to lead 2. The same segmentation and zero-centering process as in the first layer gives **Y***h* ∈ R*k*×*mNL*1 (*h* = 1, <sup>2</sup>), the matrices of the concatenated segments for all the input vectors, for each lead.

Applying the CCA-SVD filter extraction technique with *S* ˜ *hh* = (**<sup>Y</sup>***h*)(**<sup>Y</sup>***h*)*<sup>T</sup>* leads us to perform the SVD of *M* ˜ 1 = *S* ˜ −1 11 *S* ˜ 12*S* ˜ −1 22 *S* ˜ 21 and *M* ˜ 2 = *S* ˜ −1 22 *S* ˜ 21*S* ˜ −1 11 *S* ˜ 12, for the first and second lead, respectively. The filters are then found exactly as in the first convolutional layer and we denote as *W* ˜ 1,- and *W* ˜ 2,-, - = 1, ... , *L*2, the *L*2 filters of size *k* extracted for the first and second lead, respectively.

As for the convolutions, for each initial lead *h* = 1, 2 and channel *l* ∈ {1, ... , *<sup>L</sup>*1}, the signal *xh*,*cat*,*<sup>l</sup>* yields *L*2 outputs *xh*,*cat*,*l*,- = *xh*,*cat*,*<sup>l</sup>* ∗ *W* ˜ *h*,-, - = 1, ... , *L*2. At this stage, each initial lead of a heartbeat is now represented by *L*1 × *L*2 vectors of size *m*.

The second extraction step is the same as after the first convolutional layer except for a few points. First, for each heartbeat, the output of the second convolutional layer is converted to a decimal signal as per *T* ˜ *l* = ∑*<sup>L</sup>*2 -=1 2-<sup>−</sup>1*H*([*<sup>x</sup>*1,*cat*,*l*,-, *<sup>x</sup>*2,*cat*,*l*,-]) ∈ <sup>R</sup>2*m*, *l* ∈ {1, ... , *<sup>L</sup>*1}. The second feature vector for the heartbeat is obtained as per *f* 2 = [*Bhist*(*T*˜1), *Bhist*(*T*˜2),..., *Bhist*(*T*˜*L*1 )]. The *Bhist* are built with a block size and an overlapping parameter equal to *u*2 and *R*2, respectively.

The third and fourth convolutional layers are built similarly. *f* 3 and *f* 4 refer to the third and fourth feature vectors extracted for a heartbeat after each layer. We denote as *L*3 and *L*4, the number of filters for the third and fourth layers, respectively. *u*3 and *u*4 are the block sizes for the construction of *Bhist* after the third and fourth convolutional layers, respectively. Finally, we denote as *R*3 and *R*4, the overlapping parameters for the last two layers.

#### 3.2.4. Final Output and PCA

For a given heartbeat, the final output of the network is obtained by concatenating the four feature vectors, as per *f*neur = [ *f* 1, *f* 2, *f* 3, *f* 4]. Given the significant size of the final feature vector, a PCA is carried out to reduce dimensionality. The number of components is chosen such that the explained variance is over 99.99% thus obtaining *f*pca. The final feature vector F is obtained by concatenating this vector to the vector of time-domain features corresponding to the heartbeat. F = [ *f*pca, *<sup>x</sup>*time] is a vector of size 1382 or 3020, for INCART or MIT-BIH heartbeats, respectively.

The classification step is performed by a linear SVM, with a regularization parameter *C* = 1.
