A Multi-Stage Automatic Method Based on a Combination of Fully Convolutional Networks for Cardiac Segmentation in Short-Axis MRI

Silva, Italo Francyles Santos da; Silva, Aristófanes Corrêa; Paiva, Anselmo Cardoso de; Gattass, Marcelo; Cunha, António Manuel

doi:10.3390/app14167352

Open AccessArticle

A Multi-Stage Automatic Method Based on a Combination of Fully Convolutional Networks for Cardiac Segmentation in Short-Axis MRI

by

Italo Francyles Santos da Silva

^1,†

,

Aristófanes Corrêa Silva

^1,†

,

Anselmo Cardoso de Paiva

^1,†

,

Marcelo Gattass

^2,†

and

António Manuel Cunha

^3,4,*,†

¹

Applied Computing Group (NCA), Federal University of Maranhão, São Luís 65085-580, Brazil

²

Tecgraf Institute, Pontifical Catholic University of Rio de Janeiro, Rio de Janeiro 22453-900, Brazil

³

Engineering Department, University of Trás-os-Montes and Alto Douro, 5000-801 Vila Real, Portugal

⁴

ALGORITMI Research Centre, University of Minho, 4800-058 Guimarães, Portugal

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2024, 14(16), 7352; https://doi.org/10.3390/app14167352 (registering DOI)

Submission received: 4 June 2024 / Revised: 9 August 2024 / Accepted: 16 August 2024 / Published: 20 August 2024

(This article belongs to the Special Issue Application of Neural Computation in Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Magnetic resonance imaging (MRI) is a non-invasive technique used in cardiac diagnosis. Using it, specialists can measure the masses and volumes of the right ventricle (RV), left ventricular cavity (LVC), and myocardium (MYO). Segmenting these structures is an important step before this measurement. However, this process can be laborious and error-prone when done manually. This paper proposes a multi-stage method for cardiac segmentation in short-axis MRI based on fully convolutional networks (FCNs). This automatic method comprises three main stages: (1) the extraction of a region of interest (ROI); (2) MYO and LVC segmentation using a proposed FCN called EAIS-Net; and (3) the RV segmentation using another proposed FCN called IRAX-Net. The proposed method was tested with the ACDC and M&Ms datasets. The main evaluation metrics are end-diastolic (ED) and end-systolic (ES) Dice. For the ACDC dataset, the Dice results (ED and ES, respectively) are 0.960 and 0.904 for the LVC, 0.880 and 0.892 for the MYO, and 0.910 and 0.860 for the RV. For the M&Ms dataset, the ED and ES Dices are 0.861 and 0.805 for the LVC, 0.733 and 0.759 for the MYO, and 0.721 and 0.694 for the RV. These results confirm the feasibility of the proposed method.

Keywords:

cardiac segmentation; magnetic resonance imaging; fully convolutional networks; ACDC dataset; M&Ms dataset; deep learning

1. Introduction

The World Health Organization (WHO) states that one of the leading causes of death globally is cardiovascular diseases, or CVDs [1]. They might arise as a result of a sedentary lifestyle, a poor diet, or a hereditary susceptibility, and they cause over 17.5 million deaths per year [2]. Early diagnosis significantly increases the likelihood of successful treatment for CVDs. In this context, cardiac magnetic resonance imaging (CMRI) is a widely used method for diagnosing CVD because it offers a reliable analysis of the anatomy and morphology of cardiac structures [3,4,5].

In the mentioned analysis, the segmentation of the internal and external contours of the cardiac structures is an important step that precedes the quantification of their mass and volume values [6]. However, this process is time-consuming when performed manually by the specialist. Furthermore, it can be tiring and, therefore, susceptible to errors that may lead to an inaccurate diagnosis.

Computational methods that automate the segmentation process can benefit specialists by providing more accurate analysis. Deep learning-based methods, as fully convolutional networks (FCNs) have achieved interesting results in tasks related to image and video segmentation [7,8], including medical imaging [9,10]. FCNs perform image feature extraction from the highest to the lowest level, identifying and extracting hidden patterns that are relevant to classifying pixels as either belonging to structures of interest or not. In addition, these networks have been used for two, three, or more-dimensional medical imaging. However, processing such images can incur a high computational cost, which depends on the task’s complexity and the adopted solution.

The class imbalance between pixels, where most image components do not belong to the object of interest, is another inherent issue in medical imaging [11]. Thus, creating a viable approach that can solve the aforementioned issues and provide reliable results is still a challenge.

This paper presents a multi-stage method that combines FCNs for cardiac segmentation in short-axis MRI. The three main stages of the method are as follows: (1) extraction of a region of interest (ROI); (2) segmentation of the myocardium (MYO) and left ventricular cavity (LVC) using a coarse-to-fine approach, which combines two FCNs: the proposed EAIS-Net and U-Net; and (3) right ventricle (RV) segmentation using another proposed FCN called IRAX-Net. The final segmentation result is created by combining the outputs of the second and third stages.

The following points are presented as contributions: (1) a multi-stage method with specific segmentation approaches for each structure of interest; (2) a strategy based on the U-Net to minimize processing costs and class imbalance in the ROI extraction process; (3) the proposal of the EAIS-Net, an FCN that combines EfficientNet [12] with Inception [13], Squeeze-and-Excitation [14], and Attention blocks [15]; (4) the proposal of the IRAX-Net, an FCN that is based on X-Net [16] combined with Inception and Attention blocks.

The structure of this work is as follows: Section 2 presents some approaches found in the literature; Section 3 introduces the proposed method; Section 4 describes the datasets, experimental setup and the results; Section 5 presents a discussion; and Section 6 shows conclusions and future works.

2. Related Works

This Section aims to present the related works that served as a reference for developing the study on cardiac segmentation in short-axis CMRI using convolutional neural networks. Some U-Net-based methods utilize locally acquired datasets [17,18], which impedes both the reproducibility and comparative analysis of results. Public datasets, such as the Sunnybrook Cardiac Dataset [19] and the LVSC [20], are used in many works employing FCNs [21,22] or architectures incorporating Attention Gates in their architecture [23] to perform segmentation in a single-stage process. However, the datasets mentioned above are focused on segmenting the LV contours. On the other hand, the ACDC [24] and M&Ms [25] datasets are aimed at segmenting the LVC, MYO, and RV, which is the primary goal of this paper. Therefore, this section presents the works that use these datasets in more detail.

Starting with the ACDC dataset, a combination of 2D and 3D U-Nets was proposed by Isensee et al. [26] for a single-process segmentation. Along the same line, Baumgartner et al. [27] carried out experiments using 2D and 3D U-Net, and the FCN-8s, maintaining specific input slice dimensions to preserve information. However, the 2D and 3D models that are used require input slices with large width and height values, resulting in a computationally intensive process that may not always be feasible.

The study by Baccouch et al. [28] shows a comparative analysis of CNN and U-Net performance for cardiac MRI segmentation. During training, the authors explored data augmentation techniques, including morphological operations. A hyperparameter estimation process was also investigated, focusing on learning rate, number of epochs, and optimization algorithms such as ADAM and RMSProp to identify the most effective approach for optimizing model parameters.

Modified FCN architectures were proposed to address the cardiac segmentation. Zotti et al. [29] introduced GridNet, an architecture derived from U-Net that replaces the skip connections with convolutional layers to enhance the number of relevant feature maps. Khened et al. [30] presented an approach called multi-scale residual dense-net used to produce many feature maps while avoiding the gradient explosion problem. Calisto and Lai-Yuen [31] introduced AdaEn-Net, an adaptive ensemble comprising 2D and 3D FCNs. The adaptation process proposed involves running a multi-objective optimization algorithm to determine the network’s hyperparameters.

Multi-stage approaches were also developed. The method proposed by Ammar et al. [32] has an ROI extraction process that employs fast Fourier transform (FFT) to locate the heart region. Then, the extracted ROIs are passed to the conventional U-Net for segmentation, followed by a refinement step focused on the MYO masks. Similarly, Simantiris and Tziritas [33] proposed a two-stage method. However, the ROI extraction is based on the work of Grinias and Tziritas [34]. The second step uses a dilated convolutional neural network (DCNN) for segmentation.

Among the works related to the M&Ms dataset, many approaches based on U-Net and its variations are found. The method developed by Scannell et al. [35] presents an approach called domain-adversarial learning, which aims to enhance the generalization of neural networks by ensuring that they learn features independently of the input domain. During the training of a 2D U-Net for cardiac structure segmentation, its intermediate feature maps are passed to a CNN optimized to classify the input domains. Finally, the parameters of the U-Net are updated via adversarial back-propagation.

Adversarial learning is also explored by Li et al. [36] in a proposed semi-supervised method that combines U-Net and Conditional GAN (CGAN) as an autoencoder. In this process, the images are pre-cropped to a size of

160 \times 160

and registered using a chosen reference image to align the centers of the LVC. The U-Net initially produces masks, which are fed into the CGAN framework’s generator network. The authors also propose a loss function to guide the overall learning of the combined networks.

Some works present modified CNNs or novel architectures to address the multi-center, multi-disease, and multi-view scenarios presented by the M&Ms dataset. The approach developed by Full et al. [37] utilizes a framework called nnU-Net, designed for automated training of segmentation networks. This framework simplifies the configuration of hyperparameters and enables experimentation with various data augmentation techniques. Benefiting from this, Al Khalil et al. [38] also used the nnU-Net; however, including in the proposed pipeline a step for image synthesis augmentation, which leads them to achieve promising results.

Huang et al. [39] also utilized U-Net as a segmentation method in conjunction with a style transfer process using a proposed neural network called the ST network. The training of this network aims to generate a new image as output, obeying a new appearance pattern when given an input image. The U-Net, on the other hand, is trained with images that adhere to this new pattern. According to the authors, this method can be applied to cine-MRI images from different scanners.

Lin et al. [40] developed a cascaded method that utilizes two proposed networks called AEM-Net. The first network is used to segment the entire cardiac region encompassing the three structures of interest (LVC, MYO, and RV). This result is then passed to the second network, which simultaneously generates masks for each structure.

Huang et al. [41] proposed a network called RegUNet, consisting of three modules: encoder, decoder, and localization. The encoder extracts the semantic features from the input images. Its output is simultaneously passed to the ROI localization module and the decoder. The localization module identifies an ROI encompassing LVC, MYO, and RV. Finally, the decoder generates segmentation using the features extracted by the previous two modules.

Singh et al. [42] proposed ARW-Net, which combines residual links and attention gates with the W-Net architecture [43]. In this method, residual connections are incorporated in the encoder part of the network, while attention gates are used in the decoder part. ARW-Net uses a combination of Dice and Cross-entropy losses in a specific proportion (0.8:0.2) selected based on empirical evaluation and experimentation.

Similarly to some of the mentioned works, the method proposed by the present work also follows a multi-stage approach, starting with the ROI extraction to crop a region encompassing the LVC, MYO, and RV, and then to a segmentation network, which yields the final masks of those structures. In contrast, the proposed method presents two other stages after the ROI extraction. Thus, the segmentation method is divided into smaller tasks, each based on the specific characteristics of the exams and cardiac structures.

In a previous work [44], we developed a cascade method using FCNs. In that work, the extracted ROIs were sent to three different FCNs to perform the initial segmentation of each cardiac structure separately. These results were then sent to a third stage called refinement, in which a reconstruction module based on the use of two U-Nets in addition to image processing techniques to improve the masks generated and provide the final result. However, this method is more complex despite the cascade proposal as it uses six FCNs, one in the ROI extraction, three in the initial segmentation stage, and two more for reconstruction. In addition, the refinement step also comprises image processing techniques parameterized according to the characteristics presented by the images in the ACDC dataset.

On the other hand, the proposed method presented in this work was evaluated in a broader context using both ACDC and M&Ms datasets. In this method, the ROIs extracted by the first stage are fed into two different stages, each developed based on a specific segmentation context. In the MYO and LVC segmentation stage, the method uses the proposed EAIS-Net, which is employed to segment those two structures simultaneously. This architecture combines EfficientNet with Attention, Inception, and squeeze-and-excitation blocks. Additionally, a scope reduction process is employed in this step to address the class imbalance and refine the yielded results. In the RV segmentation, the proposed IRAX-Net is used. This architecture is based on X-Net and incorporates Inception and Attention blocks.

3. Proposed Method

This section presents the proposed multi-stage method for cardiac segmentation, as illustrated in Figure 1. Three stages compound the method. In the first stage, an ROI extraction process limits the analysis to a specific region of interest. The extracted ROI is then utilized for the MYO and LVC segmentation in the second stage. Lastly, the third stage focuses on the RV segmentation. The subsequent sections provide more details about the proposed method.

3.1. First Stage: Extracting an ROI

ROI extraction aims to reduce background areas, remove information irrelevant to the segmentation process, and mitigate the imbalance of pixel classes by identifying and extracting the region containing the MYO, LVC, and RV. Consequently, this stage minimizes computational costs by avoiding processing the original-sized slices.

The input images are first preprocessed. Due to our hardware limitations, they are resized to

160 \times 160

. Then, the pixel intensities are normalized to a range between 0 and 1. After preprocessing, the images are passed to the U-Net with VGG16 as the backbone [9,45], whose blocks are composed of 2D convolutional layers with ReLu activation and Batch normalization, techniques commonly used to prevent overfitting.

To simplify the learning process and due to the centralization of MYO in the medial slices, as can be seen in Figure 2, the U-Net was trained to generate masks specifically for the myocardium rather than all three structures. As only one region will be the focus of the segmentation, the complexity of this task will be decreased. Therefore, in this stage, U-Net will generate MYO masks for a given volume only for medial slices. These masks are called reference segmentations because they are the starting point for extracting the bounding boxes that will delimitate the ROI. In tests, this approach resulted in more accurate segmentation results. So, in this context, the last U-Net layer consists of the

1 \times 1

convolution with the sigmoid activation function to perform the pixel-level classification between 0 (background) and 1 (MYO).

Before the bounding-box generation, the predictions are eroded using a

3 \times 3

squared structuring element to eliminate tiny disconnected artifacts. Subsequently, the method identifies the input slice with the largest area in its reference segmentation. A bounding box is then generated from the center of this slice, and its dimensions are adjusted based on the original slice size in order to include the structures of interest.

After conducting tests, it was verified that bounding boxes with dimensions between

120 \times 120

and

160 \times 160

were defined. These bounding boxes are replicated throughout the volume and used as a reference for extracting the ROI. Finally, these ROIs are fed into the second and third stages of the proposed method.

3.2. Second Stage: MYO and LVC Segmentation

This stage receives the ROIs extracted from the previous stage as input. It generates the masks for MYO and LVC using an approach divided into three substages: coarse segmentation, extraction of a new ROI, and refined segmentation. Each step will be explained in detail below.

3.2.1. Coarse Segmentation

The coarse segmentation substage starts with preprocessing the ROIs passed as input. After zero padding is applied, the size of ROIs is set to

160 \times 160

, which was chosen due to being the largest among the ROIs extracted in the previous stage. So, they are submitted to the EAIS-Net, a network proposed in this work, depicted in Figure 3.

EAIS-Net uses EfficientNet B3 [12] for feature extraction in the encoder path. The main idea is to benefit from the main mechanisms used by this network, which are the MBConv blocks composed of two structures: inverted bottlenecks (IBs) and squeeze-and-excitation (SAE) blocks. IB blocks have fewer parameters to alter than conventional convolutions because they employ depth-wise convolutions [46]. Additionally, the SAE Blocks aim to calibrate feature maps and prioritize the most significant ones during the learning process by giving them larger weights [14]. The choice for the B3 model was made based on tests and the promising outcomes in the ImageNet and ImageNet-V2 challenges reported for the EfficientNet family [47].

EAIS-Net uses proposed structures called AIS blocks in the decoder path, followed by upsampling

2 \times 2

, always halving the filters generated (from 512 to 32). The acronym AIS refers to the combination of Attention, Inception, and SAE blocks, as seen in Figure 4.

Consider g as the output of a preceding block and x as the skip connection originating from the encoder path. The Attention block receives both x and g as inputs. With a scale of 2, the Upsampling operator takes g as input. The two outputs are concatenated and subsequently fed into the Inception block, followed by the SAE block, which produces the final output of the AIS block.

The incorporation of Attention blocks in this work draws inspiration from the study conducted by Oktay et al. [15]. These blocks assign weights to specific regions within feature maps transmitted through skip connections. As a result, the most relevant regions of the map receive more weight than the others. Furthermore, the use of Inception blocks [13] aims to learn features at different scales as the size of the structures of interest changes throughout the CMRI volume.

Lastly, the output block uses pointwise convolutions with the Softmax activation function. This process culminates in generating the initial segmentation for the MYO and LVC. However, this result may have some errors. Based on that, a hypothesis emerged that the results could be improved by including a second segmentation substage applied under a new ROI. This process will be described in the following section.

3.2.2. Mask Aggregation Process: The Extraction of a New ROI

In this substage, the masks generated by the initial segmentation are used to delimit a new ROI. Thus, it is intended to eliminate more background regions and reduce the imbalance between classes of pixels.

The mask aggregation process follows a few steps, as shown in Figure 5. For each volume, the MYO and LVC masks generated by EAIS-Net are combined, resulting in a new mask called supermask. This supermask comprises all regions of the volume where the structures of interest are found.

However, the supermask inherits the inaccuracies produced by the coarse segmentation substage, such as wrongly predicted background regions (false positives). Then, a morphological opening is applied to reduce them using a circular structuring element of size

5 \times 5

. After that, some regions are disconnected from the supermask. Then, a region filter is applied to keep only the object with the largest area closer to the center of the image, which will always be the supermask. Next, a morphological dilation is applied to the supermask using the same parameters as the opening. Finally, the supermask resulting from this process extracts a new ROI from all the volume slices. Then this new ROI will be submitted to the next substage: the refined segmentation.

3.2.3. Refined Segmentation

The refined segmentation substage uses the U-Net to segment MYO and LVC based on the newly extracted ROIs obtained through mask aggregation. This network was chosen based on the experiments performed. The U-Net used has the same architecture described in Section 3.1. Therefore, this network receives the new ROIs generated in the preceding substage and obtains the LVC and MYO masks. These masks will be combined with those generated by the RV segmentation step, whose details will be presented in the following section.

3.3. Third Stage: RV Segmentation

The RV segmentation is addressed in a separate stage since the experiments verified that this strategy offered a better result. The RV is a structure with variable size and shape throughout the exam volume. Therefore, it is the most complex among the structures of interest studied in the scope of this work.

To segment the RV, the proposed method employs an additional FCN, called IRAX-Net, whose proposed architecture is based on the X-Net [16]. As input, this network receives the ROIs extracted by the ROI extraction step (Section 3.1), which are first preprocessed as described in Section 3.2.1.

Figure 6 shows the proposed architecture of IRAX-Net. This network has two phases of feature extraction. In the first one, Inception-ResNet-v2 (IRv2) [48] is used as a backbone. This network comprises Inception blocks that incorporate residual connections instead of the traditional concatenation of filters in their output. This strategy allows Inception residual blocks to perform feature extraction with multiscale filters and a lower computational cost.

In the second phase’s encoder path, only convolutional blocks (kernel

3 \times 3

) and max pooling operators (kernel

2 \times 2

) are used. Upsampling operators (kernel

2 \times 2

) are used in both decoder paths. This network also shares feature maps between the layers using skip connections. In the second phase, Attention blocks are used to assign weights to the regions of these maps that are most relevant to the learning process (similarly to EAIS-Net). The last layer consists of convolutions

1 \times 1

and sigmoid activation function for the pixel-wise classification between the values 0 (background) and 1 (right ventricle). Finally, as a final step of the proposed method, the masks generated for MYO, LVC, and VD are combined to produce the final result.

4. Experiments and Results

This section provides additional information regarding the datasets. Additionally, the proposed method’s results are reported, case studies are presented, and comparisons with related works are shown.

4.1. Datasets

The experiments were performed using two datasets: the Automated Cardiac Disease Diagnosis Challenge (ACDC) and the Multi-Centre, Multi-Vendor, and Multi-Disease Cardiac Image Segmentation Challenge (M&Ms).

The ACDC dataset [24] is composed of 150 exams divided into five groups evenly distributed: (1) healthy patients; (2) patients with previous myocardial infarction; (3) dilated cardiomyopathy; (4) hypertrophic cardiomyopathy; and (5) abnormal right ventricle.

The 150 exams in the ACDC dataset are divided into two sets: training and testing. The first contains 100 exams captured during the end-systole (ES) and end-diastole (ED) phases. This set also includes the ground truth, which delimits the MYO, LVC, and RV structures in all basal, medial, and apical slices. The training set was used in local tests during the implementation of the proposed method. Finally, the testing set consists of 50 exams without ground truth. Therefore, the generated masks had to be submitted to the challenge’s online platform to evaluate the proposed method using this set. All exams contain images of the cardiac cycle in the short-axis view.

The M&Ms dataset [25] consists of short-axis cine MRI exams of 345 patients who were healthy or affected by the following pathologies: hypertrophic, dilated, and ischemic cardiomyopathies; coronary disease; right ventricular anomalies and myocarditis. All were scanned in clinical centers in three different countries using MRI scanners from four distinct vendors, providing greater image acquisition diversity.

As in the case of the ACDC dataset, M&Ms contains exams extracted in the ES and ED cardiac phases. Of the total available, only 320 contain the annotations of the LVC, MYO, and RV structures. The dataset with annotations is divided into three sets: 150 patients for training, 34 for validation, and 136 for testing. The slices have width and height dimensions varying between

196 \times 240

and

512 \times 512

.

4.2. Experiment Setup

The proposed method was developed using hardware with an Intel Core i5-7300HQ CPU, 8 GB RAM, and an NVIDIA GeForce GTX 1050 GPU. The software specifications used in this work are Python 3.6 with the machine learning frameworks Keras [49] and TensorFlow, and Windows 10 operating system. The following metrics are used to evaluate the segmentation results: the Dice coefficient, intersection over union (IoU), sensitivity (SENS), and precision (PREC).

4.3. Results with the ACDC Dataset

This section presents the results obtained from the experiments with the ACDC dataset. As mentioned earlier, a set of 100 exams from the ACDC dataset was used to develop the proposed method. Three subsets were created at random: training (60 exams), validation (20 exams), and testing (20 exams). These subsets were kept the same during the experiments. The results of each substage of the method are shown in the following subsections.

4.3.1. ROI Extraction Stage

The experiments in this stage examined two U-Nets, A and B. Only medial slices of MYO were used to train U-Net A. On the other hand, U-Net B was trained to segment this structure over all slices of the exam. As mentioned in Section 3.1, MYO segmentation is the basis for delimiting bounding boxes in the ROI extraction step. So, a comparison study was conducted to determine which method is superior for creating the reference segmentations. Since U-Net A represents a specialized training approach, only medial slices were selected to compose the network’s training set. However, this medial slices set is small. So, to increase its size, the following data augmentation operations were used: horizontal flip, rotation in the range

[- 5 c i r c, 5 c i r c]

, and vertical and horizontal translation. The size increased from 120 to 1080 slices.

As the U-Net B was trained with all the slices (including basal and apical), it was unnecessary to use data augmentation. The training set, in this case, contains 1154 slices.

U-Nets A and B have about 18 million parameters to be tuned.

They were trained using the following hyperparameters: 300 epochs, Jaccard Loss function, learning rate of 0.001, Adam optimizer [50], decay factor of 0.1, and the early stopping technique [51].

As stated in Section 3.1, in this stage, the models were trained to segment only the MYO in the medial slices due to its better visibility, thus serving as a reference for ROI extraction. So, the test set used for the evaluation contains only medial slices, totaling 100 cases. Table 1 shows the outcomes of the MYO segmentation as a component of the ROI extraction stage.

This evaluation indicates that the specialized training represented by U-Net A is a better approach than U-Net B. The first achieved the best results for the test set with Dice of 0.9368 and IoU of 0.8820. In contrast, the latter produced less satisfactory results, thus showing the impact of having been trained with a greater variety of slices.

An additional analysis was conducted using the mean absolute error (MAE) [52] and mean average precision (mAP) [53] metrics. These metrics quantify the similarity between manually extracted bounding boxes from the ground truth and automatically generated bounding boxes based on the reference segmentations produced by U-Nets A and B. The results are presented in Table 2. The IoU threshold used in the mAP calculation is

I o U \geq 0.80

in this evaluation. This value was selected to ensure higher confidence in the detections. Thus, the generated bounding boxes are classified as true positives only if the IoU calculated with the ground truth is equal to or greater than the defined threshold.

The ROI extraction approach that used the masks generated by U-Net A as the segmentation reference obtained mAP of 0.90 and MAE of 2.93, which indicates that the extracted ROIs are very similar to the ground truth. MAE analysis indicates consistent results due to the low standard deviation. Furthermore, it is worth noting that in some cases, the MAE value is zero, indicating that the ROIs automatically extracted by U-Net A are identical to those obtained from the ground truth. Despite the input slices being resized to

160 \times 160

, which could lead to some features being changed and relevant information lost, the results show that this resizing did not cause adverse effects on the generation of bounding boxes.

In this evaluation, U-Net B also obtained the least expressive results, strengthening the choice of U-Net A to compose the ROI extraction stage in the proposed method. The high similarity with the ground truth indicates that all structures of interest (MYO, LVC, and RV) are kept within the ROIs, which will be used as input for the further stages of the proposed method.

4.3.2. MYO and LVC Segmentation Stage

Regarding the coarse segmentation substage, the same data augmentation operations used in the ROI extraction stage were applied in the current one, generating 10,386 slices to the EAIS-Net training set composed of ROIs extracted using the ground truth as a reference.

Besides EAIS-Net, experiments with the conventional U-Net and a variation of this network containing the AIS blocks (Section 3.2.1) were carried out. These networks were trained with the following hyperparameters: 300 epochs, Jaccard Loss function, Adam optimizer, and a learning rate of 0.001 with a decay of 0.1.

The Jaccard loss is a commonly used metric in semantic segmentation models, known for its scale invariance and its superior ability to reflect the perceptual quality of a model compared to pixel-wise accuracy [54,55]. The choice of the Adam optimizer is based on findings in the literature in which this algorithm has shown better performance than stochastic optimization methods [50]. The learning rate, decay factor, and number of epochs were chosen empirically. However, as the network has a high number of trainable parameters (about 53 million), a high number of training epochs is necessary for the model to converge [56].

In this case, the conventional U-Net is a basis for comparison as it is widely used in medical image segmentation methods. Its variation with AIS Blocks incorporated was tested to analyze the impact of using these structures. Table 3 shows the obtained results. The test set contains 386 slices. It comprises the ROIs extracted in the previous stage, emulating the proposed method’s workflow.

U-Net with AIS blocks in the decoder path produced better MYO segmentations than the conventional U-Net. For the LVC segmentation, there was a slight decrease in the Dice coefficient of 0.0027 and an increase in the IoU value of 0.0032. Based on these results, it is possible to validate the use of the proposed AIS blocks as a structure that can provide improvements.

The EAIS-Net, in turn, presents results that are significantly different from the Dice and IoU values obtained by the other networks. This network uses the EfficientNet B3 as the backbone and AIS blocks in the decoder path. For the LVC segmentation, EAIS-Net produced Dice of 0.9154 and IoU of 0.8742, an average improvement of 2.9% over the conventional U-Net. For the MYO segmentation, these values are, respectively, 0.8454 and 0.7578. These results indicate an average improvement of 5% over the U-Net result. Compared to other EfficientNet models, it is possible to observe that the B5 and B7 models achieved lower results than the B3 model, which is part of the EAIS-Net. Therefore, based on this experiment, EAIS-Net was chosen to compose this stage of the proposed method.

However, as mentioned in Section 3.2, there was a hypothesis that the results could be improved by including a second segmentation stage, which would be applied to the smaller ROIs. Therefore, to validate this hypothesis, tests were conducted with the same networks used in the previous experiment, which receive as input the new ROIs extracted by the mask aggregation process reported in Section 3.2.2.

The networks were trained using a set of ROIs extracted by the mask aggregation process, using ground truth as reference segmentation. On the other hand, the test set is composed of ROIs extracted following the proposed method’s workflow, i.e., the mask aggregation process uses the coarse segmentation generated via EAIS-Net as reference segmentation since this network presented the best performance in the previous experiment. The same hyperparameters (epochs, optimizer, and learning rate) of those networks tested in the coarse segmentation substage were used for the training process. The outcomes can be seen in Table 4.

As a result of the mask aggregation process, the new ROIs contain fewer background pixels. Thus, the structures of interest are more prominent, and class imbalance is further reduced. As the EAIS-Net is deeper and has more complex convolutional blocks than the conventional U-Net, its performance is better for segmentation in a context with more information. The U-Net with AIS blocks, in turn, is a less robust network, and its results surpassed EAIS-Net.

Metrics show that U-Net outperforms the other tested networks. It is a network with a simpler architecture, and because of that, it presented better performance when applied to the new ROIs extracted since they have a reduced scope. Finally, the hypothesis that inspired the development of the refined segmentation substage was validated. Therefore, the results justify the choice of the EAIS-Net and U-Net networks to compose the coarse and refined substages of the MYO and LVC segmentation process.

4.3.3. RV Segmentation Stage

The training, validation, and test sets used in this stage consist of the same ROIs from the previous experiment. However, the ground truth for all sets includes only RV annotations.

The proposed method uses the IRAX-Net to perform the RV segmentation. This network was trained for 300 epochs using the Jaccard Loss function, Adam optimizer, learning rate of 0.001, and decay factor of 0.1.

It has approximately 73.3 million trainable parameters.

For comparison, other networks besides IRAX-Net were tested. The results achieved can be seen in Table 5.

The experiments with U-Net and EAIS-Net were initially performed to segment the three structures. However, those networks presented promising results only for the MYO and LVC segmentation. In the case of the RV, less expressive results were obtained, which motivated the execution of tests with other networks.

Tests were carried out with X-Net, but this network performed poorly. Other adaptations were made to this architecture, such as using the EfficientNet B3 and Inception-ResNet-v2 (IRv2) networks as backbones. The latter, in turn, presents better results. Given that, the last adaptation proposed included the Attention blocks in the second decoder path of the modified X-Net (with the IRv2 backbone), thus developing the IRAX-Net. In the experiments performed, this network achieved the best results for RV segmentation, obtaining Dice of 0.8213 and IoU of 0.7623, which, on average, represent an increase of 5.8% over the results of EAIS-Net.

It is noted that IRv2’s feature extraction process is more effective than both the standard architecture of X-Net and its version with EfficientNet B3 as a backbone. In addition, Attention blocks proved important in obtaining more expressive results. Therefore, IRAX-Net was the network chosen to integrate the RV segmentation stage.

4.3.4. Final Result and Case Studies

The final result is defined by joining the MYO, LVC, and RV masks generated by the proposed method. Table 6 presents the overall results of Dice or IoU obtained by the proposed method. This table shows the evaluation of the method applied to the same test set, whose slices are divided into two groups referring to the cardiac phases of end-diastole (ED) and end-systole (ES). This analysis is similar to that performed in the ACDC challenge.

The results show that the proposed method produces better segmentations for ED slices. During this cardiac phase, the heart is more dilated, making the structures of interest more visible. In the case of ES slices, the results show a decrease, especially those related to the RV. In this cardiac phase, the structures are more compressed, visually changing their sizes and shapes, contributing to segmentation inaccuracies.

Figure 7 shows the qualitative results of the proposed method for the basal, medial, and apical slices. In these examples, it is possible to verify the changes in the shape and size of the structures of interest throughout the volume. Despite this, the method generated good segmentation results, emphasizing Dice values above 95% for the basal slice case. In the medial example, it is possible to observe the segmentation of the three structures, both presenting expressive results. In this example, the method generated a segmentation for MYO slightly larger than the ground truth due to the texture similarity with very close regions. The same can be seen in the apical example in the case of the RV. However, although this structure is very small in this type of slice, the method produced a good result.

Figure 8 shows some examples in which the method is applied in slices during the ES phase, which, as previously mentioned, is the one that presents segmentation inaccuracies more frequently. The basal example does not have structures of interest, but the method generated false positives due to the similarity between background regions with MYO and LVC. In the medial example, the three structures are visible. The MYO and LVC segmentations showed promising results. However, the RV has a more contracted and stretched shape, so the method only identified part of this structure. Finally, in the apical example, LVC and MYO are very small but maintain their shape features. On the other hand, the RV has a different shape and a texture similar to the background regions, which leads the method to produce false negatives.

Lastly, regarding the computational complexity, the average execution time of the proposed method is 3.3 s per exam. Among the stages of the proposed method, the MYO and LVC segmentation has the highest average execution time, which is 1.69 s per exam. This stage is divided into several substages, leading to a longer processing time. Therefore, it is possible to infer that the proposed method achieves promising results in a time considered satisfactory.

4.3.5. Segmentation Impact Analysis

Some experiments were carried out to evaluate the impact of the segmentation of cardiac structures by changes made in the proposed method. The first experiment evaluates the individual performance of the networks used in the proposed method (U-Net, EAIS-Net, and IRAX-Net). In this case, these networks represent the single-stage segmentation process. This result, therefore, is compared with that obtained by the proposed multi-stage method. This experiment intends to demonstrate the rationale for choosing a multi-stage approach over a single-stage one. A comparative analysis can be seen in Table 7.

In U-Net, EAIS-Net, and IRAX-Net tests, the input slice dimensions were resized to

160 \times 160

due to hardware limitations. In general, the results achieved by these approaches are inferior to the proposed method. One of the observed causes is that resizing the slices reduces the size of the structures of interest. This reduction complicates the feature extraction process, especially in the apical regions of the exam, where the structures of interest are naturally small. On the other hand, the proposed method divides the segmentation process into stages, allowing better handling of class imbalance and keeping the size of cardiac structures unchanged.

The other experiment aimed to analyze the impact of the ROI extraction stage on the final segmentation result. They replace the automatically extracted ROIs with a test set containing the same exams but composed of ROIs extracted from the ground truth, thus simulating a semi-supervised process. The results are presented in Table 8.

Through this experiment, it is possible to notice that the automatic process causes some losses for LVC and MYO. On the other hand, the results related to the RV obtained a slight advantage compared to the semi-supervised process. Thus, the Z-test statistical test [57] was performed to evaluate whether the difference between the results is statistically significant. The parameters used were the sample size (382 slices) and the significance level (

α

) of 0.05 (recommended). The null hypothesis is that there is a significant difference between the results. In Table 9, the results of the hypothesis test are shown. The values found, called P-values, are all greater than

α

. Therefore, the conclusion is that there is no significant difference between the results.

Based on this analysis, it is observed that the ROI extraction stage presents a low impact, confirming the feasibility of the automatic segmentation proposal in the application scenario, in which a computational solution is important to minimize the efforts of specialists.

4.3.6. Comparison with Related Works

The ACDC challenge offers a test set comprising exams without ground truth, just used for the online evaluation process which uses the following metrics: dice coefficient and Hausdorff distance (HD) for a geometric analysis; correlation (C) and bias (B) to measure ED and ES volumes; ejection fraction (EF) and myocardial mass for other clinical purposes. Table 10, Table 11 and Table 12 show a comparison between the proposed method and the works evaluated in the ACDC challenge.

The proposed method achieves results with values close to those of the top-ranked approaches in the challenge. In the LVC segmentation, this method surpasses the results of Dice ED and ES obtained by more robust approaches, such as AdaEn-Net [31] and the multi-atlas segmentation proposed by Rohé et al. [61], the latter being also surpassed in the RV segmentation (Dice ES).

The approaches with better results than the proposed method use techniques that demand computational cost and, therefore, more robust hardware, including high-performance graphic cards. The method developed by Isensee et al. [26] is outstanding, achieving first place in two out of three scenarios. It is based on an ensemble of 2D and 3D U-Nets. Moreover, their method took less than one second for the 3D model and 1–2 s for the 2D model on a Pascal Titan X GPU. This graphics card has approximately 5.6 times more CUDA cores than the GTX 1050 GPU used in developing our method (3584 versus 640), making it significantly more efficient at processing neural network ensembles and larger dimension images. As mentioned in Section 4.3.4, the average execution time of our proposed method is 3.3 s per exam. In addition, it obtains results close to those of Isensee et al. [26] with a difference of 0.007 for Dice ED in LVC segmentation.

Regarding MYO segmentation, the proposed method presents a better Dice ED than AdaEn-Net and the approach developed by Wolterink et al. [60]. Both approaches use a single CNN to segment the three structures simultaneously. The input slices are in their original size. The proposed method, in turn, is divided into stages, the first being responsible for extracting an ROI, thus eliminating more background regions and helping to obtain better results. Although this method does not have a specific stage for post-processing, it obtains Dice ED and ES values that are very close to those achieved by our previous work [44] and Painchaud et al. [59], which are approaches that use FCNs to refine the RV segmentation. The first, surpassed in Dice ED, uses the U-Net-based reconstruction module, and the latter uses an adversary autoencoder (AVAE). However, these strategies add more FCNs to the method’s pipelines, increasing their complexity.

Lastly, regarding RV segmentation, the proposed method produces a lower result than those obtained for the LVC and MYO. However, it still achieves an interesting placement, being competitive with other approaches.

4.4. Results with the M&Ms Dataset

This section presents the results obtained by the proposed method in the tests with the M&Ms dataset. It is important to highlight that these experiments were performed once the method had been consolidated. Therefore, this section focuses solely on the final results obtained by applying the selected techniques in each stage of the proposed method.

Two experiments were performed. In the first, called T_ACDC, the proposed method is used in a transfer learning approach, in which the network models are trained with the ACDC dataset, and the test is done with the M&Ms dataset. In the second one, called T_MMS, the M&Ms dataset is used for both training the models and testing. It is important to note that in the latter experiment, the networks were trained with the same hyperparameters presented in Section 4.3. The same data augmentation techniques were applied to the training set, which grew from 3286 slices to 24,616 slices. The validation and test sets have 806 and 3218 slices, respectively.

In all cases, the input slices were preprocessed as they were done in the experiments with the aforementioned ACDC dataset. Table 13 shows the comparative results obtained.

The evaluation metrics achieved more expressive values in the T_MMS experiment than in T_ACDC. Thus, it is possible to verify that the training of the models with images from the M&MS dataset yielded better results when compared to the use of models previously trained with the ACDC dataset. The M&Ms dataset has greater diversity among the tests due to the different scanner models used in the acquisition. Therefore, training the models with the M&Ms dataset can introduce features to the networks that are not present in the ACDC dataset.

With the LVC and MYO structures, all metrics showed a noticeable increase. Taking the Dice coefficient as an example, in the T_ACDC experiment, the proposed method obtained 0.7761 and 0.6757 for LVC and MYO, respectively. In the T_MMS experiment, these values increased to 0.8339 and 0.7464. For the RV, better performance of the method is also observed in the T_MMS experiment. However, the values of the metrics do not show a significant increase. For example, Dice and IoU went from 0.6945 and 0.5902, respectively, to 0.7083 and 0.6067, representing an increase of approximately 0.1. The most significant increase can be observed in the PREC metric, which increased from 0.7432 to 0.7675. These results indicate a reduction in the production of false positives in RV segmentation in the T_MMS experiment.

Table 14 shows the results divided between the ED and ES cardiac phases. In the T_ACDC experiment, it can be seen that the best results were obtained for the slices in the ED phase, in which the structures are more visible than in the ES phase. In the T_MMS experiment, only the cases of LVC and RV have better results in the ED phase. For MYO, the method performs better for slices in ES. This result is influenced by the more considerable amount of examples that the M&Ms dataset has, allowing the networks to learn more features about this type of slice.

In Figure 9, the results of the segmentation produced by the proposed method for each experiment performed can be seen. In the apical example, the Dice values obtained were from 0.8726 (LVC), 0.7498 (MYO), and 0.8023 (RV) in the T_ACDC experiment, to 0.9478 (LVC), 0.8579 (MYO), and 0.8472 (RV) in the T_MMS experiment.

The medial case illustrates one of the main flaws observed in the T_ACDC experiment: the partial segmentation of structures, indicating the generation of false negatives. These failures occur less frequently in the T_MMS experiment, reiterating the importance of training the networks with examples from the M&Ms dataset. In the example in Figure 9, the Dice values went from 0.7830 (LVC), 0.9352 (MYO), and 0.8351 (RV) to 0.9237 (LVC), 0.8047 (MYO), and 0.9048 (RV).

In the basal case, an additional drawback observed in the T_ACDC experiment is the generation of false positives. The basal slices consist of LVC and MYO regions, with no RV present. However, some slices may not contain any of the structures of interest but have regions similar in shape and texture, which leads to segmentation inaccuracies. This effect is reduced in the T_MMS experiment because the M&Ms dataset contains more slices with such characteristics than the ACDC dataset.

Finally, a z-test was performed to evaluate the statistical difference between the results found for each experiment. In this test, the null hypothesis is assumed to be that there is a significant difference. The significance level (

α

) used is 0.05, so if the p-value is less than

α

, the null hypothesis is accepted. Table 15 presents the p-values obtained from the tests.

The results indicate a significant difference in all metrics used to evaluate the segmentation of MYO and LVC, corroborating the observation that the proposed method also shows a promising performance when applied to the M&Ms dataset, especially in the experiment where the network models are trained with this dataset. In the case of RV, the difference is significant only in the accuracy results, reinforcing the finding that, in the T_MMS experiment, the method was able to reduce false positives.

Comparison with Related Works

In Table 16, the results obtained by the proposed method are compared with the related works in the context of the M&Ms dataset. The metric used for analysis is the Dice coefficient, as indicated by the challenge. It is emphasized that most of the methods presented report Dice separately by cardiac phase (ED and ES), but some use overall Dice. Therefore, to facilitate comparison, the overall Dice coefficient and the Dice coefficient per cardiac phase are presented in this table.

Scannell et al. [35] used 2D U-Net with adversarial back-propagation training to handle inter-exam variability due to acquisition by different scanners. The input slices used in the mentioned work were previously cropped in the dimensions

192 \times 192

, with the cardiac structures properly centered, characterizing a semi-supervised scenario. In contrast, the proposed method simulates a fully automated scenario, with ROI extraction involving the structures of interest, also considering the possible displacements within the ROIs.

Huang et al. [39] used two networks: a so-called ST network, aimed at preprocessing the slices, which are resized to

256 \times 256

, and a 2D U-Net, which receives the output of the ST network to produce the final segmentation. On the other hand, the proposed method applies as preprocessing-only techniques aimed at reducing the noise of cine-MRI, and the multi-stage segmentation avoids the resizing of the input slices in order not to change their features, especially in the ES phase, when the cardiac structures are more compressed. Thus, the proposed method obtained superior results in the cases of MYO and LVC in ES slices.

The works that outperform the proposed method are characterized by using techniques that demand a higher computational cost. Full et al. [37], for example, used nnU-Net, a framework which allows automated network training from hyperparameter estimation and search for the best set of data augmentation operations. Lin et al. [40] and Huang et al. [41] proposed methods that receive the 3D volume of the exam at the input, requiring more robust hardware.

The proposed method, in turn, was developed through experiments with the ACDC dataset, which has a similar context to M&Ms. However, the latter contains more heterogeneity due to the different types of scanners used in image acquisition. Nevertheless, it is observed that the proposed method obtained relevant results for the evaluation metrics, surpassing some works that were developed considering, since its conception, the aspects of the M&Ms dataset.

5. Discussion

Some strengths of the proposed method can be mentioned. The proposed FCNs (EAIS-Net and IRAX-Net) outperform the U-Net results in the experiments performed with the local test set. In this case, using AIS blocks, combined with other backbones for feature extraction, was pivotal to obtain improvements in contrast to conventional convolutional blocks. In addition, experiments show that the multi-stage approach presented by the proposed method obtains better results than those based on a single-stage process, thus becoming viable in an automated segmentation context.

Compared with the related works in ACDC and M&Ms challenges, the proposed method achieves competitive results for segmenting cardiac structures. Most of the related work uses variations of U-Net for single-stage segmentation. They have been developed on more robust hardware than the multi-stage approach presented in this work.

Regarding the experiments with the ACDC dataset, all tests performed show that the proposed method produces better segmentations for the ED slices. In the evaluation on the online platform, this method obtains results that are close to the well-ranked methods in the LVC and MYO segmentation cases. In the experiments with the M&Ms dataset, which presents a context of more significant heterogeneity given the different scanners used to acquire the exams, it is observed that the proposed method presents better results for slices in the ES phase only in the case of MYO. For the other structures, the performance is better with the ED slices.

The proposed method uses four FCNs in total, a smaller number than the method developed previously [44], which uses six FCNs. The results obtained with the ACDC dataset show that even with the reduction in the number of networks, the Dice values produced are similar, emphasizing the experiments with RV, whose Dice ED of the previous method was surpassed by the method proposed in this work. It was achieved using only the IRAX-Net, whereas the previous method combined the results of FCNs and image processing techniques in the second and third steps to generate the final RV mask.

The decrease in the number of FCNs combined with other techniques used by the proposed method also reduced the execution time. In experiments with the ACDC dataset for comparison, this method presented an average execution time of 3 s per scan. The MYO and LVC segmentation stage is the slowest, with an average time of 1.6 s per scan because of its substages. In [44], the method completes its entire execution flow in an average of 4 s per scan in the same experiment. In that case, the initial segmentation stage is the most time-consuming, with an average of 2 s per exam. This slowness is due to the execution of three networks, each used to segment a specific cardiac structure.

It should also be noted that the proposed networks (EAIS-Net and IRAX-Net) have architectures comprising a combination of other backbones with attention mechanisms, and the results obtained were promising compared to other architectures made up of conventional convolution blocks.

However, the results obtained also indicate some limitations. It is observed that, in most cases, the performance of the proposed method for the ES slices is inferior to that of the ED slices. Specifically, the results for the RV presented lower values than those obtained for the LVC and MYO. The variability of shapes presented by the RV throughout the volume is one of the leading causes. False positives may be yielded from basal slices because the RV is generally occluded by structures with a similar texture. False negatives, in turn, tend to appear in the apical slices because the RV is tiny in this part of the MRI exam.

6. Conclusions and Future Works

This work addresses the cardiac segmentation in short-axis MRI by developing a multi-stage method. In the first stage, irrelevant information is removed from the input slices by extracting ROIs. The second stage performs the MYO and LVC segmentation, and the third focuses on generating RV masks.

This segmentation process is important for the early detection and treatment of CDVs.The proposed method achieved promising results. The outcomes presented by the EAIS-Net and IRAX-Net surpass the U-Net and other networks tested. Quantitatively comparable results were obtained in the validation process on the online platform of the ACDC challenge and with the test set of the M&Ms dataset. It is important to note that our approach was developed with more limited hardware resources than most works competing on the mentioned challenges.

Future work should explore other preprocessing techniques to deal with heterogeneous aspects of cardiac MRI image acquisition.

Approaches based on Attention mechanisms [63], which can be incorporated into the composition of AIS blocks, will also be tested. Additionally, experiments with vision transformers [64] and vision mamba [65] will be conducted in each segmentation stage of the proposed method to enhance the overall results, focusing on improving the RV segmentation performance.

Author Contributions

Conceptualization, I.F.S.d.S., A.M.C., A.C.S., A.C.d.P. and M.G.; methodology, I.F.S.d.S.; software, I.F.S.d.S.; writing—original draft, I.F.S.d.S.; investigation, I.F.S.d.S.; supervision, A.C.S., A.M.C. and A.C.d.P.; validation, A.C.S., A.M.C. and A.C.d.P.; funding acquisition, M.G. and A.M.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Brazilian fomenting agencies Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)—Finance Code 001, Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), Fundação de Amparo à Pesquisa Desenvolvimento Científico e Tecnológico do Maranhão (FAPEMA), Empresa Brasileira de Serviços Hospitalares (Ebserh) Grant number 409593/2021-4, and Fundação para a Ciência e Tecnologia, IP (FCT) within the RD Units Project Scope: UIDB/00319/2020 (ALGORITMI).

Data Availability Statement

The data that support the findings of this study are openly available at https://www.creatis.insa-lyon.fr/Challenge/acdc/databases.html accessed on 19 August 2024 (ACDC Dataset) and https://www.ub.edu/mnms/ accessed on 19 August 2024 (M&Ms Dataset). Scripts are available at https://github.com/francyles/sec-method accessed on 19 August 2024.

Acknowledgments

The authors acknowledge the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES), Brazil—Finance Code 001, Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), Brazil, and Fundação de Amparo à Pesquisa Desenvolvimento Científico e Tecnológico do Maranhão (FAPEMA), Brazil, Empresa Brasileira de Serviços Hospitalares (Ebserh) Grant number 409593/2021-4 and Fundação para a Ciência e Tecnologia, IP (FCT) within the RD Units Project Scope: UIDB/00319/2020 (ALGORITMI) for the financial support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

WHO. Cardiovascular Diseases. Available online: https://www.who.int/health-topics/cardiovascular-diseases (accessed on 24 May 2024).
Hazra, A.; Mandal, S.K.; Gupta, A.; Mukherjee, A.; Mukherjee, A. Heart disease diagnosis and prediction using machine learning and data mining techniques: A review. Adv. Comput. Sci. Technol. 2017, 10, 2137–2159. [Google Scholar]
Myerson, S.G.; Bellenger, N.G.; Pennell, D.J. Assessment of left ventricular mass by cardiovascular magnetic resonance. Hypertension 2002, 39, 750–755. [Google Scholar]
Faridah Abdul Aziz, Y.; Fadzli, F.; Rizal Azman, R.; Mohamed Sani, F.; Vijayananthan, A.; Nazri, M. State of the heart: CMR in coronary artery disease. Curr. Med. Imaging Rev. 2013, 9, 201–213. [Google Scholar]
Sara, L.; Szarf, G.; Tachibana, A.; Shiozaki, A.A.; Villa, A.V.; Oliveira, A.C.d.; Albuquerque, A.S.d.; Rochitte, C.E.; Nomura, C.H.; Azevedo, C.F.; et al. II Diretriz de ressonância magnética e tomografia computadorizada cardiovascular da Sociedade Brasileira de Cardiologia e do Colégio Brasileiro de Radiologia. Arq. Bras. Cardiol. 2014, 103, 1–86. [Google Scholar] [PubMed]
Li, B.; Liu, Y.; Occleshaw, C.J.; Cowan, B.R.; Young, A.A. In-line automated tracking for ventricular function with magnetic resonance imaging. JACC Cardiovasc. Imaging 2010, 3, 860–866. [Google Scholar] [PubMed]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Caelles, S.; Maninis, K.K.; Pont-Tuset, J.; Leal-Taixé, L.; Cremers, D.; Van Gool, L. One-shot video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 221–230. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Wang, J.; Liu, X. Medical image recognition and segmentation of pathological slices of gastric cancer based on Deeplab v3+ neural network. Comput. Methods Programs Biomed. 2021, 207, 106210. [Google Scholar] [CrossRef]
Li, Z.; Kamnitsas, K.; Glocker, B. Analyzing Overfitting Under Class Imbalance in Neural Networks for Image Segmentation. IEEE Trans. Med. Imaging 2020, 40, 1065–1077. [Google Scholar]
Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning (ICML 2019), Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.C.H.; Heinrich, M.P.; Misawa, K.; Mori, K.; McDonagh, S.G.; Hammerla, N.Y.; Kainz, B.; et al. Attention U-net: Learning where to look for the pancreas. In Proceedings of the 1st Conference on Medical Imaging with Deep Learning (MIDL 2018), Amsterdam, The Netherlands, 4–6 July 2018; pp. 1–10. [Google Scholar]
Bullock, J.; Cuesta-Lázaro, C.; Quera-Bofarull, A. XNet: A convolutional neural network (CNN) implementation for medical X-ray image segmentation suitable for small datasets. In Medical Imaging 2019: Biomedical Applications in Molecular, Structural, and Functional Imaging; Gimi, B., Krol, A., Eds.; International Society for Optics and Photonics; SPIE: Bellingham, WA, USA, 2019; Volume 10953, pp. 453–463. [Google Scholar] [CrossRef]
Abdeltawab, H.; Khalifa, F.; Taher, F.; Alghamdi, N.S.; Ghazal, M.; Beache, G.; Mohamed, T.; Keynton, R.; El-Baz, A. A deep learning-based approach for automatic segmentation and quantification of the left ventricle from cardiac cine MR images. Comput. Med. Imaging Graph. 2020, 81, 101717. [Google Scholar] [PubMed]
Shi, J.; Ye, Y.; Zhu, D.; Su, L.; Huang, Y.; Huang, J. Automatic segmentation of cardiac magnetic resonance images based on multi-input fusion network. Comput. Methods Programs Biomed. 2021, 209, 106323. [Google Scholar]
Radau, P.; Lu, Y.; Connelly, K.; Paul, G.; Dick, A.; Wright, G. Evaluation framework for algorithms segmenting short axis cardiac MRI. MIDAS J.-Card. MR Left Ventricle Segmentation Chall. 2009, 49, 1–7. [Google Scholar]
Suinesiaputra, A.; Cowan, B.R.; Al-Agamy, A.O.; Elattar, M.A.; Ayache, N.; Fahmy, A.S.; Khalifa, A.M.; Medrano-Gracia, P.; Jolly, M.P.; Kadish, A.H.; et al. A collaborative resource to build consensus for automated left ventricular segmentation of cardiac MR images. Med. Image Anal. 2014, 18, 50–62. [Google Scholar] [PubMed]
Tran, P.V. A fully convolutional neural network for cardiac segmentation in short-axis MRI. arXiv 2016, arXiv:1604.00494v3. [Google Scholar]
Hu, H.; Pan, N.; Wang, J.; Yin, T.; Ye, R. Automatic segmentation of left ventricle from cardiac MRI via deep learning and region constrained dynamic programming. Neurocomputing 2019, 347, 139–148. [Google Scholar]
Cui, H.; Yuwen, C.; Jiang, L.; Xia, Y.; Zhang, Y. Multiscale attention guided U-Net architecture for cardiac segmentation in short-axis MRI images. Comput. Methods Programs Biomed. 2021, 206, 106142. [Google Scholar] [CrossRef]
Bernard, O.; Lalande, A.; Zotti, C.; Cervenansky, F.; Yang, X.; Heng, P.A.; Cetin, I.; Lekadir, K.; Camara, O.; Ballester, M.A.G.; et al. Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: Is the problem solved? IEEE Trans. Med. Imaging 2018, 37, 2514–2525. [Google Scholar]
Campello, V.M.; Gkontra, P.; Izquierdo, C.; Martín-Isla, C.; Sojoudi, A.; Full, P.M.; Maier-Hein, K.; Zhang, Y.; He, Z.; Ma, J.; et al. Multi-Centre, Multi-Vendor and Multi-Disease Cardiac Segmentation: The M & Ms Challenge. IEEE Trans. Med. Imaging 2021, 40, 3543–3554. [Google Scholar] [CrossRef]
Isensee, F.; Jaeger, P.F.; Full, P.M.; Wolf, I.; Engelhardt, S.; Maier-Hein, K.H. Automatic cardiac disease assessment on cine-MRI via time-series segmentation and domain specific features. In Proceedings of the International Workshop on Statistical Atlases and Computational Models of the Heart, Quebec City, QC, Canada, 10–14 September 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 120–129. [Google Scholar]
Baumgartner, C.F.; Koch, L.M.; Pollefeys, M.; Konukoglu, E. An exploration of 2D and 3D deep learning techniques for cardiac MR image segmentation. In Proceedings of the International Workshop on Statistical Atlases and Computational Models of the Heart, Quebec City, QC, Canada, 10–14 September 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 111–119. [Google Scholar]
Baccouch, W.; Oueslati, S.; Solaiman, B.; Labidi, S. A comparative study of CNN and U-Net performance for automatic segmentation of medical images: Application to cardiac MRI. Procedia Comput. Sci. 2023, 219, 1089–1096. [Google Scholar]
Zotti, C.; Luo, Z.; Humbert, O.; Lalande, A.; Jodoin, P.M. GridNet with automatic shape prior registration for automatic MRI cardiac segmentation. In Proceedings of the International Workshop on Statistical Atlases and Computational Models of the Heartt, Quebec City, QC, Canada, 10–14 September 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 73–81. [Google Scholar]
Khened, M.; Kollerathu, V.A.; Krishnamurthi, G. Fully convolutional multi-scale residual DenseNets for cardiac segmentation and automated cardiac diagnosis using ensemble of classifiers. Med. Image Anal. 2019, 51, 21–45. [Google Scholar] [PubMed]
Calisto, M.B.; Lai-Yuen, S.K. AdaEn-Net: An ensemble of adaptive 2D–3D Fully Convolutional Networks for medical image segmentation. Neural Netw. 2020, 126, 76–94. [Google Scholar]
Ammar, A.; Bouattane, O.; Youssfi, M. Automatic cardiac cine MRI segmentation and heart disease classification. Comput. Med. Imaging Graph. 2021, 88, 101864. [Google Scholar] [CrossRef]
Simantiris, G.; Tziritas, G. Cardiac MRI Segmentation with a Dilated CNN Incorporating Domain-Specific Constraints. IEEE J. Sel. Top. Signal Process. 2020, 14, 1235–1243. [Google Scholar]
Grinias, E.; Tziritas, G. Fast fully-automatic cardiac segmentation in MRI using MRF model optimization, substructures tracking and B-spline smoothing. In Proceedings of the International Workshop on Statistical Atlases and Computational Models of the Heart, Quebec City, QC, Canada, 10–14 September 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 91–100. [Google Scholar]
Scannell, C.M.; Chiribiri, A.; Veta, M. Domain-adversarial learning for multi-centre, multi-vendor, and multi-disease cardiac MR image segmentation. In Proceedings of the International Workshop on Statistical Atlases and Computational Models of the Heart, Lima, Peru, 4 October 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 228–237. [Google Scholar]
Li, S.; Zhang, Y.; Yang, X. Semi-supervised Cardiac MRI Segmentation Based on Generative Adversarial Network and Variational Auto-Encoder. In Proceedings of the 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Houston, TX, USA, 9–12 December 2021; pp. 1402–1405. [Google Scholar] [CrossRef]
Full, P.M.; Isensee, F.; Jäger, P.F.; Maier-Hein, K. Studying robustness of semantic segmentation under domain shift in cardiac MRI. In Proceedings of the International Workshop on Statistical Atlases and Computational Models of the Heart, Lima, Peru, 4 October 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 238–249. [Google Scholar]
Al Khalil, Y.; Amirrajab, S.; Lorenz, C.; Weese, J.; Pluim, J.; Breeuwer, M. Reducing segmentation failures in cardiac MRI via late feature fusion and GAN-based augmentation. Comput. Biol. Med. 2023, 161, 106973. [Google Scholar]
Huang, X.; Chen, Z.; Yang, X.; Liu, Z.; Zou, Y.; Luo, M.; Xue, W.; Ni, D. Style-invariant cardiac image segmentation with test-time augmentation. In Proceedings of the International Workshop on Statistical Atlases and Computational Models of the Heart, Lima, Peru, 4 October 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 305–315. [Google Scholar]
Lin, M.; Jiang, M.; Zhao, M.; Ukwatta, E.; White, J.A.; Chiu, B. Cascaded triplanar autoencoder m-net for fully automatic segmentation of left ventricle myocardial scar from three-dimensional late gadolinium-enhanced mr images. IEEE J. Biomed. Health Inform. 2022, 26, 2582–2593. [Google Scholar] [PubMed]
Huang, X.; Chen, W.; Liu, X.; Wu, H.; Wen, Z.; Shen, L. Left and Right Ventricular Segmentation Based on 3D Region-Aware U-Net. In Proceedings of the 2022 IEEE 35th International Symposium on Computer-Based Medical Systems (CBMS), Shenzhen, China, 21–22 July 2022; pp. 137–142. [Google Scholar]
Singh, K.R.; Sharma, A.; Singh, G.K. Attention-guided residual W-Net for supervised cardiac magnetic resonance imaging segmentation. Biomed. Signal Process. Control 2023, 86, 105177. [Google Scholar]
Singh, K.R.; Sharma, A.; Singh, G.K. W-Net: Novel deep supervision for deep learning-based cardiac magnetic resonance imaging segmentation. IETE J. Res. 2023, 69, 8960–8976. [Google Scholar]
da Silva, I.F.S.; Silva, A.C.; de Paiva, A.C.; Gattass, M. A cascade approach for automatic segmentation of cardiac structures in short-axis cine-MR images using deep neural networks. Expert Syst. Appl. 2022, 197, 116704. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556v6. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Recht, B.; Roelofs, R.; Schmidt, L.; Shankar, V. Do ImageNet classifiers generalize to ImageNet? In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 5389–5400. [Google Scholar]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the Thirty-first AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 4278–4284. [Google Scholar]
Chollet, F.; Keras. 2015. Available online: https://keras.io (accessed on 24 May 2024).
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference for Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015; pp. 1–15. [Google Scholar]
Prechelt, L. Early stopping-but when? In Neural Networks: Tricks of the Trade; Springer: Berlin/Heidelberg, Germany, 1998; pp. 55–69. [Google Scholar]
Sammut, C.; Webb, G.I. Encyclopedia of Machine Learning; Springer: New York, NY, USA, 2010; ISBN 978-0-387-30164-8. [Google Scholar]
Liu, L.; Özsu, M.T. Encyclopedia of Database Systems; Springer: New York, NY, USA, 2009; ISBN 978-0-387-39940-9. [Google Scholar]
Eelbode, T.; Bertels, J.; Berman, M.; Vandermeulen, D.; Maes, F.; Bisschops, R.; Blaschko, M.B. Optimization for medical image segmentation: Theory and practice when evaluating with dice score or jaccard index. IEEE Trans. Med. Imaging 2020, 39, 3679–3690. [Google Scholar]
Maier-Hein, L.; Reinke, A.; Godau, P.; Tizabi, M.D.; Buettner, F.; Christodoulou, E.; Glocker, B.; Isensee, F.; Kleesiek, J.; Kozubek, M.; et al. Metrics reloaded: Recommendations for image analysis validation. Nat. Methods 2024, 21, 195–212. [Google Scholar]
Götz, T.I.; Göb, S.; Sawant, S.; Erick, X.; Wittenberg, T.; Schmidkonz, C.; Tomé, A.; Lang, E.; Ramming, A. Number of necessary training examples for neural networks with different number of trainable parameters. J. Pathol. Inform. 2022, 13, 100114. [Google Scholar]
Romano, J.P.; Lehmann, E. Testing Statistical Hypotheses, 3rd ed.; Springer: Berlin, Germany, 2005; ISBN 0-387-98864-5. [Google Scholar]
Zotti, C.; Luo, Z.; Lalande, A.; Jodoin, P.M. Convolutional neural network with shape prior applied to cardiac MRI segmentation. IEEE J. Biomed. Health Inform. 2018, 23, 1119–1128. [Google Scholar]
Painchaud, N.; Skandarani, Y.; Judge, T.; Bernard, O.; Lalande, A.; Jodoin, P.M. Cardiac MRI segmentation with strong anatomical guarantees. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Shenzhen, China, 13–17 October 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 632–640. [Google Scholar]
Wolterink, J.M.; Leiner, T.; Viergever, M.A.; Išgum, I. Automatic segmentation and disease classification using cardiac cine MR images. In Proceedings of the International Workshop on Statistical Atlases and Computational Models of the Heart, Quebec City, QC, Canada, 10–14 September 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 101–110. [Google Scholar]
Rohé, M.M.; Sermesant, M.; Pennec, X. Automatic multi-atlas segmentation of myocardium with SVF-net. In Proceedings of the International Workshop on Statistical Atlases and Computational Models of the Heart, Quebec City, QC, Canada, 10–14 September 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 170–177. [Google Scholar]
Patravali, J.; Jain, S.; Chilamkurthy, S. 2D-3D fully convolutional neural networks for cardiac MR segmentation. In Proceedings of the International Workshop on Statistical Atlases and Computational Models of the Heart, Quebec City, QC, Canada, 10–14 September 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 130–139. [Google Scholar]
Sinha, A.; Dolz, J. Multi-scale self-guided attention for medical image segmentation. IEEE J. Biomed. Health Inform. 2020, 25, 121–130. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Yue, Y.; Li, Z. Medmamba: Vision mamba for medical image classification. arXiv 2024, arXiv:2403.03849. [Google Scholar]

Figure 1. The proposed multi-stage method: (1) ROI extraction stage, (2) MYO and LVC segmentation stage, and (3) RV segmentation stage.

Figure 2. Visibility of the structures of interest in the (A) basal, (B) medial, and (C) apical slices.

Figure 3. EAIS-Net: the proposed network for the coarse segmentation of MYO and LVC.

Figure 4. Architecture of the AIS Blocks used by the EAIS-Net.

Figure 5. Mask aggregation process for the extraction of a new ROI.

Figure 6. IRAX-Net: the network proposed for RV segmentation.

Figure 7. Qualitative results of the proposed method. Examples of basal, medial, and apical slices.

Figure 8. Qualitative segmentation results produced by the proposed method in ES slices.

Figure 9. Qualitative results of the proposed method in the M&Ms dataset experiments, with (A) the input, (B) the ground truth, (C) the result of the T_ACDC experiment, and (D) the result of the T_MMS experiment.

Table 1. ROI extraction stage: comparative results between U-Net A and B.

	Dice	IoU
U-Net A	0.9368	0.8820
U-Net B	0.9041	0.8261

Table 2. MAE and mAP results for evaluating the generated bounding boxes.

		MAE Analysis
	mAP	MAE	St. Dev.	Max.	Min.
U-Net A	0.90	2.93	4.27	18.25	0.00
U-Net B	0.60	6.93	6.18	24.00	0.75

Table 3. Results obtained in experiments for the coarse segmentation substage.

Structure	Methods	Dice	IoU
LVC	U-Net	0.8903	0.8400
	U-Net + AIS Blocks	0.8876	0.8432
	FCN (EfficientNetB5)	0.8995	0.8558
	FCN (EfficientNetB7)	0.8998	0.8564
	EAIS-Net	0.9154	0.8742
MYO	U-Net	0.8016	0.7002
	U-Net + AIS Blocks	0.8141	0.7183
	FCN (EfficientNetB5)	0.8262	0.7299
	FCN (EfficientNetB7)	0.8332	0.7435
	EAIS-Net	0.8454	0.7578

Table 4. Comparison between the results obtained in the experiments for the refined MYO and LVC segmentation substage.

Structure	Methods	Dice	IoU
LVC	U-Net	0.9236	0.8851
	U-Net + AIS Blocks	0.9203	0.8782
	EAIS-Net	0.9047	0.8640
MYO	U-Net	0.8547	0.7708
	U-Net + AIS Blocks	0.8393	0.7491
	EAIS-Net	0.8356	0.7498

Table 5. Comparative analysis between the networks used in the RV segmentation process.

Method	Dice	IoU
U-Net	0.7025	0.6249
EAIS-Net	0.7649	0.7013
X-Net	0.6864	0.6150
X-Net (EfficientNet B3)	0.5835	0.4826
X-Net (IRv2)	0.7208	0.6595
IRAX-Net (IRv2 + Attention Blocks)	0.8213	0.7623

Table 6. Approaches selected to compose the proposed method and the final result of the segmentation process.

Stages		Dice (Overall)	Dice ED	Dice ES	IoU (Overall)	IoU ED	IoU ES
MYO and LVC segmentation (EAIS-Net + U-Net)	LVC	0.9236	0.9473	0.9000	0.8851	0.9145	0.8557
MYO and LVC segmentation (EAIS-Net + U-Net)	MYO	0.8547	0.8616	0.8479	0.7708	0.7703	0.7713
RV Segmentation (IRAX-Net)	RV	0.8213	0.8546	0.7880	0.7623	0.8003	0.7244

Table 7. Comparison between the results obtained by the proposed method and other single-stage segmentation approaches.

Structure	Methods	Dice	IoU
LVC	U-Net	0.8592	0.7837
	EAIS-Net	0.8840	0.7784
	IRAX-Net	0.8000	0.7147
	Proposed Method	0.9236	0.8851
MYO	U-Net	0.4457	0.3213
	EAIS-Net	0.4399	0.3177
	IRAX-Net	0.6169	0.4715
	Proposed Method	0.8547	0.7708
RV	U-Net	0.5975	0.4995
	EAIS-Net	0.6309	0.5353
	IRAX-Net	0.5307	0.4187
	Proposed Method	0.8213	0.7623

Table 8. Experiments performed with the proposed method following two ROI extraction approaches: semi-supervised and fully automated.

Structure	Proposed Method (ROI Extraction)	Dice	IoU
LVC	Semi-supervised	0.9360	0.8975
LVC	Fully automated	0.9236	0.8851
MYO	Semi-supervised	0.8655	0.7825
MYO	Fully automated	0.8547	0.7708
RV	Semi-supervised	0.8105	0.7524
RV	Fully automated	0.8213	0.7623

Table 9. The p-values found for Dice and IoU metrics resulting from semi-supervised and fully automated ROI extraction approaches.

	p-Values
	Dice	IoU
LVC	0.5024	0.5819
MYO	0.6670	0.6978
RV	0.7001	0.7422

Table 10. LVC segmentation. The evaluation was performed on the ACDC online platform.

Method	Dice ED	Dice ES	HD ED	HD ES	EF C.	EF B.	Vol. ED C.	Vol. ED B.
Proposed Method	0.960	0.904	14.976	17.488	0.983	0.310	0.995	−0.710
Simantiris and Tziritas [33]	0.967	0.928	6.366	7.573	0.993	−0.360	0.998	2.032
da Silva et al. [44]	0.963	0.912	8.062	10.432	0.975	1.030	0.994	0.110
Isensee et al. [26]	0.967	0.928	5.476	6.921	0.991	0.490	0.997	1.530
Zotti et al. [58]	0.964	0.912	6.180	8.386	0.990	−0.476	0.997	3.746
Painchaud et al. [59]	0.961	0.911	6.152	8.278	0.990	−0.480	0.997	3.824
Khened et al. [30]	0.964	0.917	8.129	8.968	0.989	−0.548	0.997	0.576
Baumgartner et al. [27]	0.963	0.911	6.526	9.170	0.988	0.568	0.995	1.436
Calisto and Lai-Yuen [31]	0.958	0.903	5.592	8.644	0.981	0.494	0.997	3.072
Wolterink et al. [60]	0.961	0.918	7.515	9.603	0.988	−0.494	0.993	3.046
Rohé et al. [61]	0.957	0.900	7.483	10.747	0.989	−0.094	0.993	4.182
Ammar et al. [32]	0.968	0.911	7.993	10.528	0.982	−0.390	0.997	0.650
Singh et al. [42]	0.967	0.938	5.652	6.878	-	-	-	-

Table 11. MYO segmentation. The evaluation was performed on the ACDC online platform.

Method	Dice ED	Dice ES	HD ED	HD ES	Vol. ES C.	Vol. ES B.	Mass ED C.	Mass ED B.
Proposed Method	0.880	0.892	13.440	12.590	0.972	3.210	0.971	−1.870
Isensee et al. [26]	0.904	0.923	7.014	7.328	0.988	−1.984	0.987	−2.547
Simantiris and Tziritas [33]	0.891	0.904	8.264	9.575	0.983	−2.134	0.992	−2.904
da Silva et al. [44]	0.894	0.905	7.906	9.912	0.980	−1.100	0.988	−1.820
Calisto and Lai-Yuen [31]	0.873	0.895	8.197	8.318	0.988	−1.79	0.989	−2.100
Zotti et al. [58]	0.886	0.902	9.586	9.291	0.980	1.160	0.986	−1.827
Painchaud et al. [59]	0.881	0.897	8.651	9.598	0.979	0.296	0.987	−2.906
Khened et al. [30]	0.889	0.898	9.841	12.582	0.979	−2.572	0.990	−2.873
Patravali et al. [62]	0.882	0.897	9.757	11.256	0.986	−4.464	0.989	−11.586
Baumgartner et al. [27]	0.892	0.901	8.703	10.637	0.983	−9.602	0.982	−6.861
Zotti et al. [29]	0.884	0.896	8.708	9.264	0.960	−7.804	0.984	−12.405
Wolterink et al. [60]	0.875	0.894	11.121	10.687	0.971	0.906	0.963	−0.960
Ammar et al. [32]	0.891	0.901	10.575	13.891	0.934	1.590	0.986	2.977
Singh et al. [42]	0.905	0.923	7.389	7.373	-	-	-	-

Table 12. RV segmentation. The evaluation was performed on the ACDC online platform.

Method	Dice ED	Dice ES	HD ED	HD ES	EF C.	EF B.	Vol. ED C.	Vol. ED B.
Proposed Method	0.910	0.860	16.580	17.300	0.746	2.190	0.944	8.940
Isensee et al. [26]	0.951	0.904	8.205	11.655	0.910	−3.750	0.992	0.900
Calisto and Lai-Yuen [31]	0.936	0.884	10.183	12.234	0.899	−2.118	0.989	3.550
Simantiris and Tziritas [33]	0.936	0.889	13.289	14.367	0.894	−1.292	0.990	0.906
da Silva et al. [44]	0.900	0.860	14.660	17.560	0.743	1.810	0.931	7.370
Zotti et al. [58]	0.934	0.885	11.052	12.650	0.869	−0.872	0.986	2.372
Zotti et al. [29]	0.941	0.882	10.318	14.053	0.872	−2.228	0.991	−3.722
Painchaud et al. [59]	0.933	0.884	13.718	13.323	0.865	−0.874	0.986	2.078
Khened et al. [30]	0.935	0.879	13.994	13.930	0.858	−2.246	0.982	−2.896
Baumgartner et al. [27]	0.932	0.883	12.670	14.691	0.851	1.218	0.977	−2.290
Wolterink et al. [60]	0.928	0.872	11.879	13.399	0.852	−4.610	0.980	3.596
Rohé et al. [61]	0.916	0.845	14.049	15.926	0.781	−0.662	0.983	7.340
Ammar et al. [32]	0.929	0.886	14.189	16.042	0.863	0.670	0.973	4.000
Singh et al. [42]	0.950	0.895	8.513	12.167	-	-	-	-

Table 13. Results of the experiments performed with the M&Ms dataset.

Experiment	Structure	Dice	IoU	SENS	PREC
T_ACDC	LVC	0.7761	0.6851	0.7445	0.7139
	MYO	0.6757	0.5529	0.6529	0.6105
	RV	0.6945	0.5902	0.7712	0.7432
T_MMS	LVC	0.8339	0.7553	0.8620	0.8215
	MYO	0.7464	0.6263	0.7546	0.7679
	RV	0.7083	0.6067	0.7890	0.7675

Table 14. M&Ms dataset: results of the proposed method separated by cardiac phase.

Experiment	Structure	Dice ED	Dice ES	IoU ED	IoU ES
T_ACDC	LVC	0.8385	0.7136	0.7707	0.5994
	MYO	0.6806	0.6707	0.5524	0.5535
	RV	0.7252	0.6639	0.6299	0.5505
T_MMS	LVC	0.8618	0.8059	0.7987	0.7119
	MYO	0.7330	0.7598	0.6076	0.6451
	RV	0.7218	0.6949	0.6294	0.5839

Table 15. p-values found for the evaluation metrics of the proposed method in the T_ACDC and T_MMS experiments.

	p-Values
	Dice	IoU	SENS	PREC
LVC	0.0000	0.0000	0.0000	0.0000
MYO	0.0000	0.0000	0.0000	0.0000
RV	0.2247	0.1754	0.0836	0.0229

Table 16. Overall (O) and per cardiac phase (ED and ES) Dice results of the proposed method and related works for LVC, MYO, and RV segmentation.

	LVC			MYO			RV
Method	O	ED	ES	O	ED	ES	O	ED	ES
Scannell et al. [35]	-	0.905	0.848	-	0.772	0.820	-	0.876	0.809
Huang et al. [39]	-	0.896	0.772	-	0.761	0.721	-	0.820	0.698
Full et al. [37]	-	0.939	0.886	-	0.839	0.867	-	0.910	0.860
Li et al. [36]	-	0.930	0.894	-	0.764	0.828	-	0.883	0.822
Lin et al. [40]	-	0.924	0.889	-	0.827	0.859	-	0.878	0.843
Huang et al. [41]	0.945	-	-	0.869	-	-	0.912	-	-
Al Khalil et al. [38]	0.925	-	-	0.821	-	-	0.901	-	-
Singh et al. [42]	-	0.940	0.890	-	0.839	0.870	-	0.909	0.856
Proposed Method	0.833	0.861	0.805	0.746	0.733	0.759	0.708	0.721	0.694

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Silva, I.F.S.d.; Silva, A.C.; Paiva, A.C.d.; Gattass, M.; Cunha, A.M. A Multi-Stage Automatic Method Based on a Combination of Fully Convolutional Networks for Cardiac Segmentation in Short-Axis MRI. Appl. Sci. 2024, 14, 7352. https://doi.org/10.3390/app14167352

AMA Style

Silva IFSd, Silva AC, Paiva ACd, Gattass M, Cunha AM. A Multi-Stage Automatic Method Based on a Combination of Fully Convolutional Networks for Cardiac Segmentation in Short-Axis MRI. Applied Sciences. 2024; 14(16):7352. https://doi.org/10.3390/app14167352

Chicago/Turabian Style

Silva, Italo Francyles Santos da, Aristófanes Corrêa Silva, Anselmo Cardoso de Paiva, Marcelo Gattass, and António Manuel Cunha. 2024. "A Multi-Stage Automatic Method Based on a Combination of Fully Convolutional Networks for Cardiac Segmentation in Short-Axis MRI" Applied Sciences 14, no. 16: 7352. https://doi.org/10.3390/app14167352

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Stage Automatic Method Based on a Combination of Fully Convolutional Networks for Cardiac Segmentation in Short-Axis MRI

Abstract

1. Introduction

2. Related Works

3. Proposed Method

3.1. First Stage: Extracting an ROI

3.2. Second Stage: MYO and LVC Segmentation

3.2.1. Coarse Segmentation

3.2.2. Mask Aggregation Process: The Extraction of a New ROI

3.2.3. Refined Segmentation

3.3. Third Stage: RV Segmentation

4. Experiments and Results

4.1. Datasets

4.2. Experiment Setup

4.3. Results with the ACDC Dataset

4.3.1. ROI Extraction Stage

4.3.2. MYO and LVC Segmentation Stage

4.3.3. RV Segmentation Stage

4.3.4. Final Result and Case Studies

4.3.5. Segmentation Impact Analysis

4.3.6. Comparison with Related Works

4.4. Results with the M&Ms Dataset

Comparison with Related Works

5. Discussion

6. Conclusions and Future Works

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI