Unsupervised Image Segmentation on 2D Echocardiogram

Cacao, Gabriel Farias; Du, Dongping; Nair, Nandini

doi:10.3390/a17110515

Open AccessArticle

Unsupervised Image Segmentation on 2D Echocardiogram

by

Gabriel Farias Cacao

¹

,

Dongping Du

^1,*

and

Nandini Nair

²

¹

Department of Industrial, Manufacturing, and Systems Engineering (IMSE), Texas Tech University, Lubbock, TX 79409, USA

²

PennState College of Medicine, Heart and Vascular Institute, Milton S. Hershey Medical Center, 500 University Dr, Hershey, PA 17033, USA

^*

Author to whom correspondence should be addressed.

Algorithms 2024, 17(11), 515; https://doi.org/10.3390/a17110515

Submission received: 26 September 2024 / Revised: 1 November 2024 / Accepted: 4 November 2024 / Published: 7 November 2024

(This article belongs to the Section Algorithms and Mathematical Models for Computer-Assisted Diagnostic Systems)

Download

Browse Figures

Versions Notes

Abstract

:

Echocardiography is a widely used, non-invasive imaging technique for diagnosing and monitoring heart conditions. However, accurate segmentation of cardiac structures, particularly the left ventricle, remains a complex task due to the inherent variability and noise in echocardiographic images. Current supervised models have achieved state-of-the-art results but are highly dependent on large, annotated datasets, which are costly and time-consuming to obtain and depend on the quality of the annotated data. These limitations motivate the need for unsupervised methods that can generalize across different image conditions without relying on annotated data. In this study, we propose an unsupervised approach for segmenting 2D echocardiographic images. By combining customized objective functions with convolutional neural networks (CNNs), our method effectively segments cardiac structures, addressing the challenges posed by low-resolution and gray-scale images. Our approach leverages techniques traditionally used outside of medical imaging, optimizing feature extraction through CNNs in a data-driven manner and with a new and smaller network design. Another key contribution of this work is the introduction of a post-processing algorithm that refines the segmentation to isolate the left ventricle in both diastolic and systolic positions, enabling the calculation of the ejection fraction (EF). This calculation serves as a benchmark for evaluating the performance of our unsupervised method. Our results demonstrate the potential of unsupervised learning to improve echocardiogram analysis by overcoming the limitations of supervised approaches, particularly in settings where labeled data are scarce or unavailable.

Keywords:

unsupervised learning; image segmentation; echocardiography; convolutional neural networks; medical imaging

1. Introduction

Medical imaging plays a pivotal role in the diagnosis, treatment, and monitoring of various medical conditions. Artificial intelligence (AI) is transforming the healthcare landscape, and its application in cardiology is particularly promising [1]. In echocardiography—a type of ultrasound used to examine the heart—AI is being utilized [2] to automate tasks, improve accuracy, and potentially reduce costs. Among the most commonly used methods in cardiology, 2D echocardiography stands out due to its non-invasive nature, real-time imaging capabilities, and widespread availability in clinical settings. Echocardiography uses ultrasound waves to create detailed images of the heart, enabling clinicians to assess cardiac function, structure, and pathology. However, despite its significant advantages, the interpretation of echocardiograms remains a challenging task, often requiring expert knowledge and considerable manual effort [3].

The complexity of cardiac structures, coupled with the variability and noise inherent in echocardiographic images, makes segmentation a critical yet difficult task. Image segmentation, the process of partitioning an image into meaningful regions, is a fundamental pre-processing step that facilitates subsequent analysis and diagnosis. Traditional segmentation methods, such as normalized cuts and graph-based techniques, rely on predefined features, which often fail to capture the complexity of medical images [4]. These methods are particularly ill-suited for echocardiograms, where consistent features are difficult to define due to variability in imaging conditions, patient anatomy, and pathology.

In recent years, deep learning, particularly convolutional neural networks (CNNs), has revolutionized medical image analysis, offering automated, accurate, and efficient solutions for segmentation tasks. CNNs have demonstrated remarkable success in supervised image segmentation by learning hierarchical features from large, annotated datasets. Notable supervised approaches, such as fully convolutional networks (FCNs), U-Net, and SegNet, have been widely adopted for various medical imaging applications, including cardiac imaging, brain segmentation, and retinal vessel detection [3,4]. These models excel at learning complex patterns from labeled data, enabling them to achieve state-of-the-art performance on challenging medical images.

However, supervised learning has its limitations. The primary challenge lies in the need for extensive labeled data, which is often scarce and expensive to obtain, particularly in the medical domain. In echocardiography, where expert knowledge is required to manually annotate cardiac structures, acquiring labeled datasets is both time-consuming and resource-intensive. Furthermore, annotated datasets may not capture the full diversity of imaging conditions, leading to a reduced generalization of supervised models in real-world clinical settings [5]. These limitations highlight the need for alternative approaches that can operate effectively without relying on large, annotated datasets.

In response to these challenges, there has been growing interest in unsupervised image segmentation methods, which aim to identify meaningful structures within an image based solely on the intrinsic properties of the data, without the need for labeled training data [5]. These methods offer a promising alternative, particularly in medical domains where labeled data are limited. Several unsupervised methodologies have been proposed for image segmentation, including k-means clustering, Gaussian mixture models (GMMs), and self-organizing maps (SOMs). While these approaches offer simplicity and effectiveness in some scenarios, they often struggle with the complexity and variability of medical images, particularly echocardiograms.

Recent advancements in unsupervised learning have focused on integrating CNNs with unsupervised learning objectives, combining the representational power of deep networks with clustering and feature learning techniques. For instance, Kim et al. introduced an end-to-end unsupervised segmentation method based on differentiable feature clustering, which iteratively optimizes feature representations and pixel labels to achieve high-quality segmentation results [6]. These approaches offer the potential to leverage the strengths of CNNs while avoiding the need for extensive labeled data.

In the context of echocardiography, unsupervised segmentation holds significant promise. Echocardiographic images are often noisy, subject to considerable variability, and lack consistent annotations, making them ideal candidates for unsupervised techniques. By avoiding reliance on annotated data, unsupervised methods can explore anomalies and segment heart structures more effectively, particularly at the pixel level, providing a more accurate delineation of the heart’s anatomy. By leveraging the rich feature representations learned by CNNs and the robust clustering capabilities of unsupervised methods, it is possible to develop effective segmentation algorithms that enhance echocardiogram analysis [5,7].

This paper presents a novel unsupervised methodology for segmenting 2D echocardiography images by combining objective functions with CNN-based feature extraction to achieve accurate and robust segmentation of cardiac structures. The proposed method addresses the limitations of traditional unsupervised techniques by leveraging deep learning for feature refinement and spatial continuity, offering a more accurate and flexible approach to cardiac segmentation. Evaluated on a dataset of 2D echocardiograms from Stanford [7], this approach demonstrates its efficacy and potential for improving automated cardiac analysis. By integrating deep learning with clustering techniques, this study advances the field of unsupervised medical image segmentation, providing a robust framework for segmenting complex cardiac images without the need for extensive annotated data. Additionally, it offers the flexibility to handle previously unseen data, paving the way for more efficient and accurate echocardiographic analysis.

2. Related Work

Deep learning is revolutionizing medical image analysis, particularly in echocardiography, where it automates tasks like view classification and cardiac function quantification, which were previously manual and time-consuming [5]. This section reviews the applications of deep learning in echocardiogram analysis, discusses image segmentation as a key task in medical image analysis, and focuses on its specific applications within cardiology.

2.1. Deep Learning for Image Segmentation in Medical Image Analysis

Image segmentation, which partitions an image into meaningful regions representing different anatomical structures or abnormalities, is fundamental in medical image analysis [8,9]. It plays a critical role in extracting quantitative information, visualizing anatomical structures, and guiding image-guided interventions.

Segmentation enables the quantification of anatomical features such as organ volume, wall thickness, and lesion size, providing essential data for diagnosis, treatment planning, and disease monitoring [9]. Deep learning models provide segmentation masks that allow for the creation of detailed 3D visualizations of anatomical structures, facilitating surgical planning and improving patient education. For instance, Lin et al. [10] explored segmentation in image-guided surgery (IGS) by leveraging deep learning techniques for generating 3D visualizations of patient anatomy, which were used for preoperative planning and intraoperative navigation. Their study showed significant improvements in surgical accuracy and outcomes through the use of augmented reality (AR) and virtual reality (VR) visualization technologies.

In another study, Cruz-Aceves et al. [11] applied unsupervised segmentation techniques to cardiac images, focusing on myocardial and blood pool segmentation in cardiac MR images. Their approach, which combined swarm optimization with active contours, achieved high segmentation accuracy with a Dice score of 0.89 and provided a robust framework for handling noisy datasets. The Dice score is a commonly used metric in image segmentation, which measures the overlap between the predicted segmentation and the ground truth, with a score of 1 indicating perfect overlap and 0 indicating no overlap. These methods highlight the potential of deep learning and unsupervised techniques in advancing the field of cardiology by improving image segmentation and visualization capabilities for better clinical decision-making.

Additionally, real-time segmentation can assist in guiding minimally invasive procedures, enabling surgeons to accurately target instruments while avoiding critical structures. For instance, Fozilov et al. [12] proposed a novel endoscope automation framework using real-time segmentation and tracking during minimally invasive surgery. Their system integrated hierarchical quadratic programming (HQP) control with interactive perception modules, utilizing a deep CNN to segment the surgical scene and track multi-tools in real-time. This framework significantly improved precision, with minimal visual misalignment errors during tissue resection tasks. Another study by Erkmen et al. [13] focused on needle-based interventions, employing a machine learning model based on a U-Net architecture to perform 3D ultrasound segmentation. This allowed for the accurate targeting of biopsy needles while avoiding critical anatomical structures, improving safety and precision during procedures. Both approaches underscore the potential of real-time segmentation in enhancing surgical outcomes through improved visual guidance and instrument tracking [8].

2.2. Image Segmentation in Cardiology

In cardiology, image segmentation is crucial for analyzing images from various modalities, including echocardiography, cardiac Magnetic Resonance Imaging (MRI), and cardiac Computed Tomography (CT) [9]. For instance, deep learning automates the segmentation of key structures such as the left ventricle, left atrium, and aortic valve in echocardiograms [14]. Manual segmentation in echocardiography is labor-intensive and prone to inter-observer variability, making automation particularly valuable. However, challenges such as poor image quality, speckle noise, and edge dropout complicate segmentation in echocardiography [14].

Extensive studies have utilized CNNs and U-Net architectures for segmenting key structures like the left and right ventricles, myocardium, and scar tissue from cardiac MRI images, yielding state-of-the-art performance. For example, Fozilov et al. [12] employed a deep CNN for real-time segmentation in minimally invasive surgery, showcasing the versatility of CNNs in medical imaging tasks, including MRI segmentation. Similarly, Dou et al. [15] applied CNN-based unsupervised techniques to cardiac MRI for multi-modal myocardial segmentation, achieving a Dice coefficient of 0.89. Kalra et al. [16] focused on myocardial segmentation in cardiac BOLD MRI using CNN-based unsupervised models, with their method demonstrating Dice scores as high as 0.91. These examples underscore CNNs’ effectiveness in extracting features from MRI images and accurately delineating anatomical structures, even in challenging conditions.

U-Net, a specific CNN architecture, has also played a pivotal role in cardiac MRI segmentation due to its encoder–decoder structure, which preserves spatial information while facilitating pixel-wise predictions. U-Net’s skip connections ensure the retention of high-resolution spatial features during upsampling, making it particularly suitable for segmenting complex anatomical structures like the myocardium. Lin et al. [10] demonstrated U-Net’s application in 3D image-guided surgery for cardiac MRI, while Erkmen et al. [13] employed a U-Net architecture for real-time 3D ultrasound segmentation, further illustrating its adaptability across imaging modalities. Moreover, public datasets such as ACDC have accelerated advancements in this domain by providing a benchmark for model performance, fostering further improvements in cardiac segmentation using both CNN and U-Net-based approaches [9,14].

Moreover, deep learning is also employed to segment cardiac structures and coronary arteries from cardiac CT images. Accurate segmentation is vital for quantifying stenosis, assessing plaque burden, and guiding interventions like angioplasty or stenting [14].

2.3. Deep Learning in Echocardiography

Deep learning’s ability to analyze large datasets and learn complex patterns has made it highly suitable for echocardiogram analysis [2]. Various applications have emerged in this domain, including automated view classification, quantitative cardiac function assessment, direct disease detection, and automated image quality assessment. CNNs have demonstrated remarkable accuracy in classifying standard echocardiographic views such as apical four-chamber and parasternal long-axis views [14]. Automated classification is crucial for efficient analysis, as manual classification can be time-consuming and prone to inconsistencies [3,14]. For instance, one study leveraged CNNs to classify 23 different echocardiographic viewpoints with high accuracy [5].

Additionally, deep learning models are increasingly being used to quantitatively assess cardiac function by automating the measurement of cardiac parameters, such as ejection fraction (EF), left atrial end-systolic volume (LAESV), and the E/e’ ratio, all of which are vital indicators of cardiac health [1]. One of the most recent works includes EchoNet-Dynamic, which leverages temporal information from echocardiogram videos to provide beat-to-beat assessments of cardiac function, including the End-Diastolic Volume (EDV) and End-Systolic Volume (ESV), overcoming the limitations of still-image-based models [7].

Recent advances in semi-supervised learning have enabled deep learning models to leverage both labeled and unlabeled data, which is particularly valuable for echocardiography where labeled datasets are often limited. By combining labeled and unlabeled data, semi-supervised approaches enhance the model’s ability to generalize across diverse cardiac structures, even with limited annotations. Ding and Han propose a semi-supervised method that integrates both frequency-domain and spatial-domain data augmentations within the Mean Teacher framework [17]. Their approach addresses the challenge of limited annotated data by leveraging unannotated images alongside annotated ones, enhancing the model’s ability to capture global and local cardiac features. Specifically, they introduce a frequency-domain mixing strategy, which operates in the frequency domain to increase data diversity while preserving spatial structures, and an image puzzle mixing strategy, which enhances local anatomical feature perception by partitioning images and recombining them in a puzzle-like manner. Evaluated on CAMUS and EchoNet datasets, this method demonstrates superior segmentation accuracy, particularly with limited labeled data, outperforming other semi-supervised methods on metrics such as Dice and Jaccard similarity coefficients.

Furthermore, studies have explored the potential for deep learning to directly detect cardiac diseases such as hypertrophic cardiomyopathy, cardiac amyloid, and pulmonary hypertension from echocardiograms [2]. Although promising, these studies require further validation before their clinical utility can be confirmed. Moreover, deep learning models are also developed to assess echocardiographic image quality [18]. Quality assessment is crucial for reducing diagnostic errors and enhancing clinical workflow efficiency [14,19]. One study demonstrated the feasibility of CNNs in evaluating the quality of apical four-chamber echocardiograms [18].

2.4. Unsupervised Medical Image Segmentation

In the field of cardiac image segmentation, several unsupervised methods have been developed to tackle the challenges posed by the lack of labeled datasets. Aganj et al. [20] proposed an unsupervised medical image segmentation method, which was evaluated on cardiovascular magnetic resonance (MR) images. This approach demonstrated effectiveness in segmenting the ventricular myocardium and blood pool, achieving a mean Dice score of 0.60 and a cross-validated Dice score of 0.51, without relying on predefined shape models. Cruz-Aceves et al. [11] introduced a method that utilized particle swarm optimization (PSO) combined with active contours and shape priors to segment human hearts and ventricular areas from CT and MR images. This method, designed to handle the high variability in cardiac structures, yielded promising segmentation results, with a Dice score of 0.89, and demonstrated significant improvement over traditional Active Contour Model (ACM) and interactive methods in handling noise and concavities. However, both studies did not utilize echocardiograms in their analyses.

Yang et al. propose GraphEcho, a graph-driven approach for unsupervised domain adaptation (UDA) tailored to echocardiogram video segmentation [21]. GraphEcho addresses the challenges associated with transferring segmentation models across clinical sites with significant domain shifts, which can negatively impact model performance. The method introduces two key modules: Spatial-wise Cross-domain Graph Matching (SCGM) to align cardiac structures across source and target domains by capturing both local (class-specific) and global (inter-class) relationships, and Temporal Cycle Consistency (TCC), leveraging the cyclical nature of heartbeats to maintain temporal consistency across video frames. Using the newly introduced high-resolution CardiacUDA dataset collected from multiple clinical centers, GraphEcho achieved notable improvements over existing UDA techniques, including enhanced Dice scores on benchmark datasets CAMUS and Echonet Dynamic. These results demonstrate the effectiveness of GraphEcho’s graph-matching approach in adapting models to unlabeled target domains while maintaining performance close to models trained directly on target data.

Further advancements in unsupervised cardiac segmentation include the work by Dou et al. [15], which focused on adapting unsupervised techniques across different imaging modalities, specifically on point clouds and multi-modal cardiac MR images. Their method employed entropy minimization to align features between different modalities, achieving a Dice coefficient of 0.89 for myocardial segmentation. Additionally, a study by Kalra et al. [16] on myocardial segmentation in cardiac MR images also adopted an unsupervised approach, demonstrating robustness to noise and achieving Dice scores up to 0.91 when compared to manual annotations. Neither study employed echocardiograms, focusing instead on MR and other imaging modalities. These studies highlight the potential of unsupervised techniques in cardiology, particularly in overcoming the challenge of limited labeled datasets, and demonstrate comparable performance to supervised methods in terms of segmentation accuracy but using images with higher quality when compared to echocardiograms.

2.5. Challenges and Limitations

Despite significant advances, several challenges remain in applying deep learning to echocardiogram analysis and cardiac image segmentation. Deep learning models typically require large amounts of labeled data for training, which is scarce in medical domains. Techniques such as data augmentation and methods that can learn from smaller datasets are crucial to overcoming this limitation [8]. In addition, although deep learning models can achieve high accuracy, their decision-making process is often difficult to interpret, posing a barrier to clinical acceptance. Researchers are exploring ways to develop more interpretable deep learning models [4,8]. Furthermore, models trained on one dataset may not generalize well to data from different scanners, acquisition protocols, or patient populations. Ensuring robustness and generalizability is essential for real-world clinical deployment.

In summary, deep learning is transforming how we analyze echocardiograms and segment cardiac images, with applications ranging from automated view classification to disease detection and quantitative cardiac function assessment. However, challenges related to data availability, model interpretability, and generalization must be addressed to fully realize the potential of deep learning in clinical practice. This work directly addresses these challenges by leveraging an unsupervised approach to segmentation, enhancing model generalization, and reducing reliance on large labeled datasets, thus contributing to the broader applicability of deep learning in cardiovascular diagnosis, treatment, and patient management.

3. Methods

The proposed methodology for unsupervised image segmentation in echocardiography leverages the U-Net architecture to segment 2D echocardiogram images. This approach addresses the challenges associated with the variability and complexity inherent in medical images, particularly in the absence of annotated training data. The details of the method design are described as follows.

3.1. Architecture

The network architecture is shown in Figure 1, with the core segmentation process built on the U-Net architecture, chosen for its effectiveness in medical image segmentation tasks. The encoder–decoder structure, coupled with skip connections, allows it to preserve spatial information and provide pixel-wise predictions, making it well-suited for segmenting the complex structures in echocardiographic images. This design facilitates precise segmentation without relying on predefined features, which is crucial for capturing fine anatomical details [4,6,22,23].

Convolutions are a fundamental operation in deep learning models, especially for image processing tasks like segmentation. A convolutional layer applies a set of learnable filters (kernels) to the input image. These kernels slide across the image, performing element-wise multiplications with the pixel values and producing a feature map. Each kernel helps in detecting specific patterns like edges, textures, or other spatial features of the image. In our U-Net architecture, the size of each kernel is 3 × 3, meaning that each convolutional layer applies a small window of 3 × 3 pixels at a time to capture features in the image. These convolutions maintain the spatial dimensions of the image due to padding. Each convolution is followed by a ReLU (Rectified Linear Unit) activation function, which introduces non-linearity to the model by replacing negative values with zero. The ReLU function ensures that the network learns complex patterns by allowing the model to retain only positive values, improving the speed of convergence during training. Additionally, the network architecture is designed to have the following components:

Contracting Path (Down-sampling): This path reduces the spatial dimensions of the input while capturing high-level features. Each stage applies two 3 × 3 convolutions followed by ReLU activation, then down-samples with 2 × 2 max pooling. This reduces resolution but increases the number of channels, capturing more complex features.
Bottleneck (Middle Layer): The bottleneck captures the most abstract features, using two 3 × 3 convolutions with ReLU activation. It processes the reduced feature map before the expanding path begins, preserving the essential information needed for reconstruction.
Expanding Path (Up-sampling): The expanding path up-samples the feature maps with transposed convolutions, restoring spatial resolution. Skip connections from the contracting path help retain fine-grained details, ensuring the final output preserves spatial information.
Final Layer: A 1 × 1 convolution reduces the feature maps to the required number of output channels, generating a segmentation mask with the same size as the input image. Softmax activation provides class probabilities for each pixel.
Reconstruction Layer: The reconstruction layer outputs a 3-channel image using a 1 × 1 convolution, helping to regularize the training process by preserving sufficient spatial information.

The U-Net architecture efficiently combines local details from the down-sampling path with global context from the up-sampling path through its skip connections. The use of 3 × 3 kernels in the convolutions allows the model to focus on small, local regions of the image, while the max pooling and up-sampling operations handle the overall structure. ReLU activation after each convolution ensures that the model learns non-linear patterns that are critical for the accurate segmentation of medical images.

3.2. Loss Functions

To train the network effectively, a composite loss function is used, which combines the following components:

Reconstruction Loss: Ensures that the network accurately reproduces the input image by minimizing the Mean Squared Error (MSE) between the input and the reconstructed output.

$L_{r e c} = \frac{1}{N} \sum_{i = 1}^{N} {(x_{i} - {\hat{x}}_{i})}^{2}$

where $x_{i}$ is the original input image and ${\hat{x}}_{i}$ is the reconstructed image.
Contour Regularization Loss: This loss penalizes sharp variations along the boundaries in the segmentation mask, encouraging smoother predictions. It computes the difference between the maximum and minimum values of neighboring pixels within a given radius to enforce smoothness [22]. In our approach, we used a radius of 3 pixels.

$L_{ContReg} = {∥max_{n 1, n 2 \in W d (n)} (p_{n 1} - p_{n 2})∥}^{2} = {∥M d (p) + M d (- p)∥}^{2}$

where $p_{n_{1}}$ and $p_{n_{2}}$ are the predicted probabilities at neighboring positions within a window $W_{d} (n)$ .
Similarity Loss: To measure the similarity within the predicted segmentation mask, we use a cross-entropy-based similarity loss. This loss component enhances the similarity of features within the same cluster and differentiates features from different clusters [6], defined as:

$L_{sim} ({{\hat{r}}_{n}, c_{n}}) = \sum_{n = 1}^{N} \sum_{i = 1}^{q} - δ (i - c_{n}) ln {\hat{r}}_{n, i}$

where ${\hat{r}}_{n, i}$ is the predicted probability of the n-th sample belonging to cluster i. $c_{n}$ is the true cluster label for the n-th sample. $δ (t)$ is the Kronecker delta function, defined as $δ (t) = 1$ if $t = 0$ and $δ (t) = 0$ otherwise. N is the total number of samples. q is the number of clusters.

The overall loss function is a weighted sum of the individual components. Specifically, the weights used in the code are

α = 0.15

,

β = 0.32

, and

γ = 0.53

for the reconstruction loss, contour regularization loss, and similarity loss, respectively. These weights were selected through a grid search, where the values for each component varied from 0 to 0.9 (increments of 0.1), with the sum of the weights always constrained to 1, which reduced the number of combinations. Once the first best set of weights was identified, we further fine-tuned them (increments of 0.01) using the same grid search approach to maximize performance. The final loss is computed by:

L_{total} = α \cdot L_{rec} + β \cdot L_{ContReg} + γ \cdot L_{sim}

3.3. Training Process

Pre-Processing the Network Inputs

The echocardiogram video dataset is first pre-processed to convert it into a format suitable for input into the deep learning model. The following steps describe this process:

Frame Selection: A subset of frames is extracted from each video sequence, ensuring they provide enough information for segmentation (frames in different phases of the cardiac cycle, EDV and ESV included). Since consecutive cardiac cycles within the same video exhibit minimal differences, increasing the number of frames significantly lengthens processing time without notable performance improvements.Therefore, selecting an appropriate number of frames that cover at least one complete cardiac cycle per video is sufficient for the training phase.
Pre-processing for Training: The selected frames are first resized to a resolution of 224 × 224 to ensure uniform input size for the model. A series of transformations is applied to augment the data, including the random application of Gaussian blur with varying kernel sizes (3, 5, and 7) and sigma values (ranging from 0.1 to 2.0), as well as random rotations of up to 5 degrees. These transformations are applied with a probability of 0.5, adding variability to the training data to make the model more robust to noise and orientation changes. Finally, the frames are normalized using the mean and standard deviation values specific to the dataset, ensuring that pixel intensities are standardized for improved model performance.
Pre-processing for Validation and Testing: For validation and testing, only resizing to 224 × 224 and tensor conversion are performed, without any additional augmentations, ensuring consistency in evaluation.

To train, validate, and test our model, we used the EchoNet-Dynamic Dataset [7], with a total of 7465 training samples, 1288 validation samples, and 1277 test samples. Instead of using all frames from the videos, we opted to use 16 frames per video, which significantly improved both the training time and model performance. This decision was based on the comparison of models using 4, 8, and 32 frames, where the loss did not show significant improvement with either fewer or more frames Figure A2. Visual inspection further confirmed that the results with 16 frames provided more satisfactory segmentation, capturing key cardiac cycles effectively. Additionally, using 16 frames allowed us to perform a wider grid search for the optimal hyperparameters, as the training time per epoch averaged around 1 min (while if using 32 frames, we would exceed 20 min per epoch), making it more efficient than using 32 frames or more. This approach aligns with the findings in [7], which highlighted the impact of video length on model efficiency. To further enhance the quality of the data and minimize the number of samples used for training, we implemented a pre-processing pipeline that scored frames based on sharpness and contrast [18], utilizing techniques such as Contrast Limited Adaptive Histogram Equalization (CLAHE) for contrast enhancement and Laplacian variance for sharpness scoring. This allowed us to refine the dataset, before starting the actual pre-processing, by selecting the highest-quality videos for training, resulting in shorter training times and improved overall model performance.

We assessed video quality using a range from 0 to 1, selecting only those videos with an average score of 0.8 or higher. As a result, out of 7465 training videos, 628 were selected based on this quality criterion. For the validation set, 118 out of 1288 videos met the threshold, and for the test set, 110 out of 1277 videos were selected. By focusing on these high-quality videos, we were able to ensure better training efficiency and performance [7].

The performance of CLAHE and Laplacian variance was evaluated through the calculation of a quality score for each video based on two key metrics: sharpness and contrast. Sharpness was assessed using Laplacian variance, which quantifies the amount of high-frequency content in the image, effectively capturing the sharpness of the frames—higher variance indicates sharper frames. Contrast was measured using the standard deviation of pixel intensities, where a higher standard deviation suggests greater contrast between pixel values, representing clearer differentiation of anatomical structures. Before applying these metrics, each frame was pre-processed using CLAHE to enhance local contrast. The final quality score for each frame was computed as a weighted sum of normalized sharpness (with a weight of 0.3) and normalized contrast (with a weight of 0.7), reflecting the greater importance of contrast in this dataset. The overall score for each video was calculated as the average score across all frames, and only videos with an average score of 0.8 or higher were selected for use in training, validation, and testing. Through visual inspection, it was clear that videos with higher scores consistently had better chamber delineation, further confirming the effectiveness of the scoring method. This process allowed us to prioritize high-quality echocardiogram frames, improving model performance by focusing on clearer images [18]. The distribution of scores is presented in Figure 2, and examples of selected and non-selected videos are provided in the appendix for comparison.

The network was trained using the Adam optimizer [24], with an initial learning rate of 0.001. A learning rate scheduler was applied to reduce the learning rate by a factor of 0.1 every 16 epochs. The training was conducted for a total of 32 epochs. To validate the selection of these hyperparameters, a grid search was employed to systematically explore a broad range of values for learning rate, epochs, and weight decay. The grid search began by testing epoch values of 16, 32, 64, and 128, while the learning rate was tested across five increments, starting from

10^{- 1}

,

10^{- 2}

, and continuing in decreasing powers of ten. The search was refined iteratively, narrowing down to smaller intervals based on the observed loss values during training. The choice of 32 epochs was made to balance relatively short training time, considering the complexity of the model, with model performance. Observations from the grid search showed that after 32 epochs, there were no significant improvements in validation loss for most configurations, allowing us to fix the epoch number and focus on optimizing the other parameters. Thus, 32 epochs offered an efficient training duration without compromising accuracy. The final configuration was chosen by selecting the combination that minimized the validation loss consistently across the 32 epochs.

During training, a batch size of 4 videos, with 16 frames per video, was used. The model was trained on a system equipped with the following configuration: Intel Core i7-12700KF processor, 32 GB of RAM, and an NVIDIA GeForce RTX 3080 GPU.

The model with the lowest validation loss was saved at epoch 32. The weights corresponding to this model were selected based on the validation loss observed during training, ensuring that the model was saved at the point of its best performance. As shown in Figure 3, the behavior of the loss function throughout training indicates that the model effectively learned the objectives of the loss functions without overfitting, demonstrating stable and consistent learning.

3.4. Post-Processing Through 3D Watershed Segmentation

After obtaining the initial segmentation from the U-Net model, we apply a post-processing step using the 3D Watershed algorithm to isolate the left ventricle in the echocardiogram images. This step is essential because the U-Net output includes both the heart walls and chambers, necessitating further refinement to focus solely on the left ventricle.

Additionally, due to the nature of the similarity loss function, the number of segmentation masks tends to decrease as the training progresses. This reduction was observed during our testing, as shown in Figure 4. While this effect demonstrates promising segmentation of the heart walls, it necessitates additional processing to accurately extract the chamber volumes, such as the left ventricle, for instance-specific measurements.

The Watershed3D algorithm is particularly suited for this task due to its ability to handle complex 3D structures and provide precise segmentation boundaries. Unlike traditional contour-based methods, such as active contours or snakes, Watershed3D excels in environments with significant noise and overlapping structures, which are common in echocardiographic images. The algorithm works by treating the intensity values of the image as a topographical surface and simulates the flooding of this surface from local minima. The points where the floods meet define the segmentation boundaries [25].

The key steps in the 3D Watershed algorithm for isolating the left ventricle from the echocardiographic images are as follows:

Gradient Computation: The first step involves computing a gradient of the image to highlight regions of rapid intensity changes, which will serve as potential boundaries. This gradient map is combined with a positional bias that accounts for the spatial position of the pixels within the image, ensuring that the algorithm can handle complex 3D structures in a volumetric image.
Thresholding and Mask Creation: A binary mask is generated for each frame using Otsu’s thresholding technique, which automatically separates the foreground and background by selecting an optimal threshold that minimizes the variance within each group. This technique analyzes the histogram of pixel intensities in the gray-scale image, determining a threshold that best distinguishes between the two regions. After applying the threshold, all pixels below the threshold are classified as background, and those above as foreground. The mask is then inverted, as we are interested in using the background areas to initiate the watershed flooding process.
Positional Bias and Distance Transform: In addition to the standard distance transform, a positional bias is introduced. This bias is weighted across axes as follows: 0.4 for the z-axis, 0.3 for the y-axis, and 0.3 for the x-axis. This helps prioritize certain axes when calculating the distances, leading to improved segmentation accuracy for 3D data.

$D (x) = min_{y \in B} ∥ x - y ∥$

where $D (x)$ represents the distance transform at point x of the non-background structures, and y is a point in the background B. The Watershed algorithm then operates on the negative of this distance map, effectively segmenting regions based on local minima.
Marker-Based Segmentation: To control the flooding process, we use markers that identify regions of interest such as the left ventricle. These markers are derived from the U-Net’s initial segmentation output and refined with the computed distance transform.
Watershed Transformation: The Watershed transformation is then applied, segmenting the regions based on the previously computed markers and gradient map. The segmentation is constrained to a maximum number of classes, and labels are assigned to each segmented region.
Label Assignment and Visualization: The segmented regions are labeled, and each class is assigned a distinct color using a persistent color map for visualization. This step ensures that the left ventricle is isolated for further analysis, and the colors remain consistent across frames. However, due to the nature of the algorithm, it is challenging to consistently assign the same color/label to each chamber across different videos. This variability introduced the need for on-the-fly selection of the left ventricle for subsequent mask isolation and volume calculations.

The 3D Watershed algorithm was initialized by first using a threshold to isolate low-intensity regions, which typically represent the background. To reduce noise, morphological operations such as dilation and closing were applied. A distance transform was then computed, with a positional bias added to account for spatial orientation across the z, y, and x axes. Local maxima were detected in the distance-transformed volume to serve as markers for the watershed algorithm, which then segmented the volume. Consistent class labels were maintained across frames by calculating centroids for each label and matching them to reference centroids from the first frame. This approach ensures spatial coherence across the video sequence. The process can be observed in Figure 5.

This approach was a necessary step over directly using the U-Net output because it provides more robust handling of complex regions of interest, such as the heart chambers, as shown in Figure 6. Additionally, the Watershed algorithm allows for precise control over the segmentation process through marker-based initialization, which is crucial for accurately isolating the left ventricle. By using the distance transform and introducing a positional bias, the algorithm effectively separates the chambers based on the difference in distances across the frames.

3.5. Performance Evaluation

In this section, we describe the steps taken to evaluate our network’s performance, including the methodology used for segmentation and ejection fraction (EF) calculation, as well as the process of extracting and evaluating segmentation masks from the ground truth data.

3.5.1. Visual Inspection of Segmentation Results

To begin the evaluation of our network’s output, we performed a visual inspection of the segmentation results during training, particularly focusing on challenging echocardiogram cases. These cases included images with significant noise, poor imaging angles, or frames where the heart was shifted laterally. This visual inspection allowed us to qualitatively assess the network’s performance in handling these difficult scenarios and make real-time adjustments to the training parameters where necessary.

3.5.2. Quantitative Evaluation and Mask Extraction

Following the visual inspection, after the model was fully trained, we performed a quantitative evaluation using the segmented left ventricles. With the segmented left ventricle, we computed the Dice coefficient and Intersection over Union (IoU) to assess the segmentation accuracy, comparing our method against the ground truth and bench-marking it with the EchoNet-Dynamic model.

The ground truth (GT) masks were generated through a semi-automated process based on expert annotations of the heart chambers at various points in the cardiac cycle. These tracings, representing the left ventricle’s endocardial borders, were used to create binary masks by filling the contours of the traced polygons. This process allowed for accurate delineation of the ventricular boundaries based on expert input and followed the same approach used by Ouyang et al. [7].

In contrast, the masks produced by the EchoNet Dynamic model were extracted from its predicted segmentation output. Specifically, the left ventricular mask was represented by a red pixel value (R = 255, G = 1, B = 1) in the top-right quadrant of the predicted video. The extraction process involved isolating this region and applying morphological operations to fill any gaps. The results obtained through this mask extraction were highly consistent with those reported in the original EchoNet Dynamic paper [7], confirming that our extraction method accurately replicates the model’s intended performance.

In addition to segmentation accuracy, we calculated the ejection fraction (EF) from the segmented volumes of the left ventricle and compared it to the ground truth EF values. To ensure a fair comparison, we used the same test set (filtered by our quality assessment criteria) when comparing with the EchoNet model. This approach resulted in minor differences in the final results, as EchoNet’s calculations averaged EF over multiple frames, whereas we calculated EF using only two frames: one in the diastolic position and the other in the systolic position, as specified in the ground truth.

The Dice and IoU values are calculated using the following formulas:

Dice Coefficient (Dice) is given by:

$Dice = \frac{2 \cdot | A \cap B |}{| A | + | B |}$

where A represents the predicted segmentation and B corresponds to the ground truth. The Dice coefficient measures the overlap between the predicted and ground truth segments, with a value of 1 (100%) indicating perfect overlap.
Intersection over Union (IoU), also known as the Jaccard Index, is defined as:

$IoU = \frac{| A \cap B |}{| A \cup B |}$

IoU measures the ratio of the intersection to the union of the predicted and actual segments, providing a stricter evaluation than Dice.

For evaluating EF, we computed the Mean Absolute Error (MAE), the Mean Absolute Percentage Error (MAPE), and the Root Mean Squared Error (RMSE) between the EF estimated by our method and the ground truth as:

MAE = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |

MAPE = \frac{100 %}{n} \sum_{i = 1}^{n} |\frac{y_{i} - {\hat{y}}_{i}}{y_{i}}|

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}

where

y_{i}

is the ground truth EF value, and

{\hat{y}}_{i}

is the predicted EF value. MAE measures the average magnitude of the errors in a set of predictions, without considering the direction of the errors (whether they are over- or under-estimated). Lower MAE values indicate better model accuracy. MAPE is similar to MAE but expresses the error as a percentage of the true value, making it scale-independent and easier to interpret across different ranges of EF values. This is particularly useful when comparing predictions across different patients or EF ranges, as it allows for a direct understanding of the relative size of the error. MAPE is a preferred metric when comparing models across datasets or when the magnitude of the values varies significantly. However, it is important to note that MAPE can be sensitive to small values of

y_{i}

, where errors may disproportionately inflate the percentage. RMSE is sensitive to larger errors than MAE because the differences are squared before averaging, thus penalizing larger errors more heavily. This makes RMSE useful in situations where larger deviations from the true EF values are particularly undesirable, such as in clinical assessments where significant over- or under-predictions could have serious consequences. RMSE provides a general sense of the model’s accuracy and its robustness in minimizing large errors.

4. Results

4.1. Quantitative Analysis

In this section, we present the quantitative results of our model’s performance, focusing on key metrics that evaluate the segmentation accuracy and the EF prediction. By comparing our unsupervised method with the EchoNet Dynamic supervised model, we aim to provide a comprehensive assessment of our model’s performance in both segmentation and EF prediction tasks. Instead of directly comparing volumes (EDV and ESV), we evaluated the EF values as we do not have the conversion factor from pixel measurements to standard units of measurement. EF, being a ratio, does not require unit conversion, allowing us to compare the results across different methods without the need for geometric calibration.

4.1.1. Segmentation Accuracy

The segmentation accuracy was evaluated using the Dice coefficient and IoU. These metrics were applied to compare our model’s segmentation against the ground truth as well as the EchoNet Dynamic model. As previously discussed, the Dice coefficient measures the overlap between predicted and ground truth segmentations, while IoU offers a stricter comparison by assessing the intersection relative to the union of the predicted and true segmentation.

The results from the segmentation metrics, shown in Table 1, indicate that our method achieved a Dice coefficient of 0.8225 (95% CI: [0.8098, 0.8353]) and an IoU score of 0.7071 (95% CI: [0.6928, 0.7213]) when compared to the ground truth. In comparison, the EchoNet Dynamic model reported a higher Dice coefficient of 0.9241 (95% CI: [0.9207, 0.9275]) and an IoU of 0.8600 (95% CI: [0.8542, 0.8657]). EchoNet’s higher scores can likely be attributed to the fact that it leverages extensive (segmentation masks created by expert sonographers) and more complex architecture. However, this small discrepancy emphasizes the potential of an unsupervised approach, especially when fine-tuned or combined with more sophisticated post-processing techniques or data acquisition methods. The confidence intervals demonstrate the stability of our model’s performance across multiple tests, further confirming the robustness of our unsupervised segmentation approach.

Additionally, our model demonstrates satisfactory accuracy, particularly when capturing the edges of the left ventricle and the movement of the mitral valve. Figure 7 shows a visual comparison between the segmentation results produced by our model, the ground truth, and the EchoNet Dynamic model. The figure illustrates how closely our model segments the heart wall, making the masks from our model slightly bigger than the others, while noting any discrepancies.

4.1.2. Ejection Fraction

To evaluate the accuracy of the predicted EF values, we calculated the MAE, MAPE, and RMSE between the predicted and ground truth EF values. These metrics were selected to provide a comprehensive view of the prediction accuracy, with each metric emphasizing different aspects of error (absolute, percentage, and squared errors). Table 2 summarizes the comparison of EF prediction accuracy between our method and the EchoNet Dynamic model.

In the evaluation of EF prediction (Table 2), our method demonstrated competitive performance, with an MAE of 13.7057 and an RMSE of 16.9375, compared to the EchoNet Dynamic model, which reported an MAE of 13.1116 and an RMSE of 18.5931. Although EchoNet achieved a slightly lower MAE, indicating marginally better overall accuracy in EF prediction, our model exhibited a lower RMSE, suggesting that it produced fewer large errors in individual cases. The MAPE also reflects this balance, where our method had an error of 28.01%, while EchoNet achieved 25.38%. This comparison suggests that while both models are capable of accurately predicting EF values, our method may perform more consistently across different cases, even if it produces slightly larger errors in some instances. Given that EF is a critical clinical metric for assessing cardiac function, the small differences in these results highlight the robustness and potential clinical relevance of our approach.

One of the primary sources of error in our model’s segmentation output is the overestimation of the left ventricle’s area as the model approaches the edges of the heart wall. This is a consequence of the pixel-level precision of our unsupervised segmentation method, which can result in slight overshooting of boundaries, particularly in regions where the heart wall is less distinct. Additionally, the natural movement of the heart during the cardiac cycle, including the motion of the valves, is also captured in our segmentation output. This differs from methods that rely on annotated points to trace an elliptical shape, which typically exclude such anatomical features. As a result, our model’s segmentation of the left ventricle sometimes includes the valves and slight imperfections in the heart wall, leading to a more irregular shape compared to the smoother, idealized boundary defined by annotations. Furthermore, shades of the heart wall or the back wall of the heart, visible within the chamber, introduce additional variance in the segmentation output. These shaded areas can cause the model to misinterpret noise as part of the anatomical structure, leading to minor inaccuracies. Such issues are common in echocardiographic image analysis due to the inherent variability in ultrasound image quality.

Despite these sources of variance, the errors in segmentation are well within the typical range of measurement variation observed between clinicians. Inter-observer variation in echocardiographic measurements has been reported to be as high as 13.9% [7], with slight deviations in segmentation boundaries being relatively common. For instance, EchoNet-Dynamic, a leading supervised model for echocardiogram analysis, demonstrated an area under the curve of 0.97 when classifying cardiomyopathy based on an ejection fraction threshold of 50%. This highlights the fact that even supervised models trained on large datasets experience similar levels of variance, particularly in edge cases. Our unsupervised approach, while exhibiting occasional segmentation inconsistencies due to image artifacts, remains competitive within this typical range of variation, offering promising results without reliance on annotated data.

4.2. Qualitative Analysis

In addition to quantitative metrics, qualitative analysis was performed to visually assess the model’s segmentation outputs, especially in challenging echocardiogram frames. Qualitative data play a significant role in evaluating the nature of pixel-level segmentation that unsupervised learning techniques tend to produce. Our network segments the heart wall with precise accuracy due to the pixel-wise nature of the loss functions employed during training.

While this approach achieves excellent edge detection for the heart wall, by visual inspection over the full test set, it sometimes struggles with images containing noise or artifacts. In such cases, the network may erroneously interpret noise as part of the heart structure, leading to imperfect segmentation results. The qualitative evaluation confirmed the sensitivity of the model to noise and its potential to produce fine-grained contours in ideal scenarios. This can be observed in Figure 8 and Figure 9.

Figure 8 shows examples of frames where the model performed well, producing segmented images with clearly defined boundaries and accurate representation of the target regions. These examples demonstrate the model’s ability to correctly differentiate between anatomical structures and background, particularly in frames with less noise and clearer features.

On the other hand, Figure 9 highlights cases where the model failed to accurately segment the frames. In these examples, the segmentation results show poor boundary detection, incomplete segmentation, or excessive noise, which leads to misrepresentation of the target areas. These poor performance cases may be due to factors such as low image quality, unclear boundaries, or challenging anatomical structures that the model struggles to distinguish.

This qualitative analysis highlights the strengths and limitations of our approach, emphasizing the need for noise reduction or artifact handling to improve segmentation robustness in challenging scenarios.

5. Conclusions

In this study, we trained and evaluated an unsupervised model for echocardiogram segmentation, focusing on the left ventricle for metric assessments. One of the main challenges of training an unsupervised model lies in the absence of labeled data, but this limitation also serves as an advantage. With the support of quality assessment methods, which ensure the collection of high-quality data even without labels, the model can achieve comparable or even superior performance. This is a significant advantage over supervised methods, as the model can operate in environments where labeled data are limited or unavailable. The flexibility of unsupervised models allows for broader applications, and the results of our model showcase this potential.

The simplicity of our model’s architecture, based on a U-Net with 22 convolutional layers, offers a distinct advantage compared to EchoNet-Dynamic, which employs a Deeplabv3 architecture with a 50-layer residual network for segmentation. Furthermore, EchoNet-Dynamic leverages pre-trained R2+1D spatio-temporal convolutions from the Kinetics-400 dataset, while our model was trained from scratch without pre-trained initialization. This highlights a key strength of unsupervised models, which do not rely on pre-existing labeled datasets, yet can still perform well when provided with high-quality training data and a robust architecture.

Another major difference between our model and EchoNet-Dynamic lies in the dataset size. EchoNet-Dynamic was trained on a large dataset of 7465 videos with pre-processed labels and segmentation masks, whereas our model demonstrated competitive results with a significantly smaller dataset. This ability to train effectively with fewer data points presents a substantial advantage in settings where acquiring medical data is costly and time-consuming, and labeled data are even more difficult to obtain and ensure precision.

Our model showed competitive segmentation accuracy and ejection fraction calculation when compared to EchoNet-Dynamic in the same testing environment. The pixel-level precision achieved by our network is particularly useful in applications requiring detailed boundary detection, though it presents challenges in noisy environments where artifacts may cause incorrect segmentations. This underscores the importance of integrating robust data quality assessments during the training process.

Looking forward, there are several paths for improvement and future research. While our unsupervised model performed well, integrating semi-supervised or weakly supervised learning approaches could further enhance its capabilities by leveraging small amounts of labeled data to fine-tune performance. Additionally, expanding the dataset used for training, including more diverse echocardiographic images and patient conditions, could improve generalizability and robustness. Exploring more advanced post-processing techniques could also help reduce the impact of noise and artifacts in real-world clinical settings. By continuing to refine these techniques, unsupervised models can offer increasingly accurate and scalable solutions for echocardiogram analysis, contributing to more efficient and accessible cardiac care.

Author Contributions

Conceptualization, G.F.C. and D.D.; methodology, G.F.C. and D.D.; formal analysis, G.F.C. and D.D.; investigation, G.F.C. and D.D.; writing—original draft preparation, G.F.C.; writing—review and editing, G.F.C., D.D. and N.N.; supervision, D.D. and N.N.; project administration, D.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data used in this study are available from the database of Stanford and can be accessed at https://github.com/echonet/dynamic (accessed on 28 October 2023), and the code for the proposed model can be accessed at https://github.com/gabrielfc7/UISE2D (accessed on 28 October 2023).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. Quality Assessment of Selected vs. Non-Selected Videos

Figure A1. (a) Example of videos not selected after quality assessment. Lower contrast between the chambers and the heart walls, with more noise inside the chambers; (b) example of videos selected after quality assessment. Better sharpness, well-defined edges, and clear chamber visibility.

Appendix A.2. Training and Validation Loss for Different Frame Sizes

The graph in Figure A2 compares the training and validation loss curves for models trained with different frame sizes (4, 8, and 16 frames). Despite the four-frame model achieving a smaller loss value, this is attributed to the reduced number of frames used for computation. Even with normalization applied, the variance between the frames results in a lower overall loss. All models exhibit similar trends in loss reduction, but the 16-frame model was chosen for further experiments due to its better segmentation performance during validation, optimal loss value at 32 epochs, and reasonable training time of approximately 1 min per epoch. The 32-frame model, although potentially providing more information, required significantly more computation time (over 20 min per epoch), making it less practical for training. Note that 32 epochs were selected as the loss stabilized around epoch 23, and extending the training further would likely only increase processing time or risk overfitting the model.

Appendix A.3. Grid Search and Weight Selection

In this work, we performed a grid search to explore various weight configurations for the loss components: reconstruction (

r e c

), contour (

c o n t o u r

), and similarity (

s i m

). The goal was to balance the model’s segmentation performance, maintain boundary integrity, and avoid overfitting to noise.

While a few configurations yielded lower validation losses, the selected configuration (

r e c = 0.15

,

c o n t o u r = 0.32

,

s i m = 0.53

), shown by the black dashed line in Figure A3, was chosen based on both loss values and visual inspection. Despite having a loss value close to others, this setup produced better results when segmenting the heart wall and handling noisy regions. Lower-loss configurations tended to overfit, capturing noise in the process, whereas the selected configuration provided a more accurate segmentation of the heart wall and noise reduction.

Figure A2. Normalized training and validation loss over epochs for models trained with different frame sizes.

Figure A3. Grid search validation losses for different weight configurations. The selected configuration (black dashed line) balances loss value and segmentation quality.

Appendix A.4. Ablation Study

Ablation studies were conducted to evaluate the influence of individual components of the loss function–reconstruction (

r e c

), contour (

c o n t o u r

), and similarity (

s i m

). By systematically adjusting the weight of each component, both individually and in combination, we aimed to determine the relative contribution of each term to the overall model performance.

As shown in Figure A4 and Figure A5, several configurations resulted in low validation loss values, particularly those with the reconstruction component weighted highly (e.g.,

r e c = 1

,

c o n t = 0

,

s i m = 0

), which reached a minimum loss value of 0.000026. However, these configurations often failed to segment the images accurately, producing a single color as the segmented output, representing only one class for the entire image, which is a sign of overfitting. An exception was observed when using a combination of two loss components, which resulted in a higher validation loss but still failed due to excessive noise in the segmentation results.

The configuration using

r e c = 0.15

,

c o n t = 0.32

, and

s i m = 0.53

led to a better segmentation result. This configuration leverages the reconstruction component as a regularization parameter to preserve the overall shape of the structures in the image, while the similarity component groups pixels with similar characteristics. The contour component, in turn, helps define clearer boundaries between regions. As a result, this combination produced more accurate segmentations, with well-defined boundaries and preserved structural integrity in the output, and it has the highest loss in our ablation test, as shown in Figure A5.

Furthermore, the validation loss behavior of this configuration, as shown in Figure A5, is a strong indication that the model is not overfitting, unlike other combinations that achieve a much lower loss by setting most pixels to the same class and rapidly reducing the loss over time. This balance of components ensures that the model generalizes better and maintains a more precise segmentation without sacrificing performance on unseen data.

Figure A4. Example of segmentation results from models trained with different combinations of loss components. The first result is from a model using reconstruction and similarity losses (

r e c = 1

,

c o n t = 0

,

s i m = 1

), showing clearer shapes but losing boundary information. The second result comes from a model using reconstruction and contour losses (

r e c = 1

,

c o n t = 1

,

s i m = 0

), which preserves the boundaries better but struggles with internal details and accuracy in the shapes. The third result is from a model using contour and similarity losses (

r e c = 0

,

c o n t = 1

,

s i m = 1

), which leads to excessive internal class noise (within the segmented pixels). These results highlight the importance of a balanced configuration that incorporates all three components to maintain both boundary clarity and internal structure.

Figure A4. Example of segmentation results from models trained with different combinations of loss components. The first result is from a model using reconstruction and similarity losses (

r e c = 1

,

c o n t = 0

,

s i m = 1

), showing clearer shapes but losing boundary information. The second result comes from a model using reconstruction and contour losses (

r e c = 1

,

c o n t = 1

,

s i m = 0

), which preserves the boundaries better but struggles with internal details and accuracy in the shapes. The third result is from a model using contour and similarity losses (

r e c = 0

,

c o n t = 1

,

s i m = 1

), which leads to excessive internal class noise (within the segmented pixels). These results highlight the importance of a balanced configuration that incorporates all three components to maintain both boundary clarity and internal structure.

Figure A5. Validation loss for different ablation configurations. The lower loss values resulted in overfitting and poor segmentation quality, while higher loss configurations provided better image segmentation performance.

Appendix A.5. Comparison with State-of-the-Art Models

In this study, we compared our segmentation model with several state-of-the-art models, including three supervised models–DeepLabV3, Mask R-CNN, and U-Net–as well as the unsupervised W-Net. Each of these models represents a different approach to image segmentation, and the comparison serves to highlight both the strengths and limitations of our unsupervised approach in relation to these widely-used supervised and unsupervised architectures.

Comparison with Supervised Models: DeepLabV3, Mask R-CNN, and U-Net are the state-of-the-art models widely used for image segmentation in various applications. They are all supervised models that require labeled datasets to learn the mapping between input images and the corresponding segmentation masks. These models have shown strong performance in a variety of segmentation tasks, especially in medical imaging where accurate boundary detection and class differentiation are critical.

DeepLabV3: Known for its ability to capture multi-scale contextual information, DeepLabV3 uses atrous convolutions and is particularly effective for semantic segmentation tasks where precise boundary detection is important. The model employs a ResNet-101 backbone and requires pixel-level annotations to achieve high-performance [26]. In this study, we used the pre-trained version of DeepLabV3 with ResNet-101 to evaluate its performance on echocardiograms. The results show that without specific fine-tuning for gray-scale images, it cannot identify any useful information from the echocardiograms (see Figure A6(4)).
Mask R-CNN: The Mask R-CNN approach extends Faster R-CNN by adding a segmentation branch to predict pixel-wise masks for each detected object. The model is highly effective for instance segmentation tasks, though it also depends on labeled data to generate accurate segmentation results [27]. It uses a ResNet-50 backbone with a Feature Pyramid Network (FPN) to capture multi-scale features. While Mask R-CNN is effective at object detection and segmentation, its performance on gray-scale echocardiograms (see Figure A6(5)) is limited because it was pre-trained on RGB images and does not account for medical-specific features like heart chambers without further adaptation.
U-Net: Widely used in biomedical image segmentation, U-Net employs an encoder–decoder structure and is particularly successful in medical imaging due to its ability to capture both low- and high-level features. Like the other supervised models, it relies on annotated data for training [28]. In this study, we adopted a ResNet-34 encoder pre-trained on ImageNet. While the U-Net has demonstrated strong performance in medical imaging tasks due to its ability to capture both low- and high-level features, like the other models, it had poor segmentation accuracy for gray-scale segmentation tasks in echocardiogram analysis (see Figure A6(6)).

The supervised nature of these models allows them to achieve high accuracy when large annotated datasets are available. However, one challenge is that labeled data can be expensive and time-consuming to obtain, especially in medical imaging tasks, where expert annotations are required. In this study, pre-trained weights were used for all the supervised models to evaluate how well these state-of-the-art architectures generalize to gray-scale echocardiographic data. Using pre-trained models saves time during training, but, as seen in Figure A6, these models struggle to segment echocardiograms without specialized objective functions or further fine-tuning. This highlights the importance of adapting both model architecture and training processes when applying general-purpose segmentation models to domain-specific medical data.

Figure A6. Top row: (1) original image, (2) segmentation using the proposed model, (3) result after applying Watershed algorithm. Bottom row: (4) segmentation using pre-trained DeepLabV3 (ResNet-101), (5) Mask R-CNN (ResNet-50 FPN), and (6) U-Net (ResNet-34 encoder with ImageNet weights). The pre-trained models were used for comparison to highlight the strengths and weaknesses of different architectures on gray-scale echocardiographic data.

Comparison with Unsupervised Models: Unsupervised models like W-Net are particularly useful in situations where labeled data are scarce or unavailable, making them an attractive choice for medical applications. W-Net uses an autoencoder-like structure, with two U-Net-style architectures working in tandem—one for encoding and one for decoding. This design enables W-Net to capture global and local patterns without the need for pixel-level annotations [4]. In this study, we trained the W-Net over our dataset to maximize its performance. An interesting observation from our experiments is that the results were fairly consistent when training the model between 50 and 150 epochs. Beyond 150 epochs, however, the segmentation quality degraded, often resulting in black images with no useful segmentation information. This indicates a limitation in the w-net approach, as the model may overfit the data and lose its ability to differentiate meaningful structures. Additionally, as seen in the comparison below, W-Net tends to struggle with fine boundary detection and noise handling, especially when applied to gray-scale echocardiograms. In Figure A7, we observe that W-Net is capable of differentiating the background from the heart structures but struggles to accurately delineate the chambers and handle noise within the image. Moreover, W-Net’s performance is influenced by the inherent challenges of unsupervised learning, such as the difficulty of learning fine-grained boundaries and dealing with the complex textures present in medical images. Although unsupervised models like W-Net are promising for scenarios with limited labeled data, they often require additional post-processing steps, such as the Watershed algorithm, to improve segmentation quality.

Figure A7. Top row: (1) original echocardiogram, (2) segmentation using W-Net. Bottom row: (3) original echocardiogram, (4) segmentation using W-Net. While the W-Net is able to differentiate some background areas, it struggles with boundary detection and noise handling, as well as accurately segmenting the heart chambers.

Discussions: While the supervised approaches can achieve highly accurate results when trained on labeled data, their dependence on extensive annotated training data poses challenges to medical image analysis research. Direct adoption of pre-trained models has been shown to have poor performance in echocardiogram segmentation. On the other hand, the recent unsupervised image segmentation techniques in computer vision show limited effectiveness even after training with a large echocardiogram dataset due to their limited capability for handling noise in grayscale images. Customized model architectures, tailored learning processes, and effective post-processing steps are essential to improve the accuracy of unsupervised techniques in echocardiogram analysis.

It is important to note that we did not fine-tune any of the supervised models in this experiment due to the lack of extensive annotated data. However, upon data availability, fine-tuning these models could potentially improve their segmentation accuracy, especially in capturing fine details and reducing errors in boundary detection. Future work could focus on exploring semi-supervised techniques to reduce training data requirements and fully leverage the capabilities of these models in echocardiographic segmentation.

Appendix A.6. Extra Results

In this section, we provide additional results to further illustrate the performance of our segmentation algorithm. The first figure shows segmentation results on additional systolic and diastolic frames, while the second figure illustrates the complete algorithm pipeline applied to a sample from our private dataset.

Figure A8. Additional comparison of segmentation results for systolic (top row) and diastolic (bottom row) frames. In this sample, a noticeable chamber deformity and imaging artifacts are present, which pose challenges to accurate segmentation. (a) Original frame, (b) ground truth mask, (c) our model mask, (d) EchoNet mask. The comparison illustrates how each segmentation method handles the deformity and artifacts. Our model shows resilience to these distortions, maintaining clear boundaries, while EchoNet struggles to account for the abnormalities.

The second figure shows the detailed steps of our segmentation algorithm pipeline as applied to a sample from our private dataset. This figure highlights each stage of the process, from pre-processing to final segmentation, allowing a visual understanding of how the algorithm performs on unseen data.

Figure A9. Segmentation algorithm pipeline applied to a sample of the private dataset: (a) Original image, (b) initial segmentation (our model), (c) final segmentation (after Watershed).

These additional results help to showcase the robustness of our segmentation algorithm and its applicability to a dataset from a different population. By applying the model to this private dataset, we further validate its generalization capability across different echocardiographic imaging sources. Furthermore, the pipeline visualization demonstrates the systematic, step-by-step process that improves segmentation outcomes, showing how initial segmentation and the application of the Watershed algorithm contribute to clearer boundaries and more accurate results. This comprehensive analysis underscores the flexibility and effectiveness of our approach, even when applied to datasets with distinct characteristics.

References

Tromp, J.; Bauer, D.; Claggett, B.L.; Frost, M.; Iversen, M.B.; Prasad, N.; Petrie, M.C.; Larson, M.G.; Ezekowitz, J.A.; Scott, D.; et al. A formal validation of a deep learning-based automated workflow for the interpretation of the echocardiogram. Nat. Commun. 2022, 13, 6776. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Gajjala, S.; Agrawal, P.; Tison, G.H.; Hallock, L.A.; Beussink-Nelson, L.; Lassen, M.H.; Fan, E.; Aras, M.A.; Jordan, C.; et al. Fully automated echocardiogram interpretation in clinical practice: Feasibility and diagnostic accuracy. Circulation 2018, 138, 1623–1635. [Google Scholar] [CrossRef] [PubMed]
Cai, L.; Gao, J.; Zhao, D. A review of the application of deep learning in medical image classification and segmentation. Ann. Transl. Med. 2020, 8, 713. [Google Scholar] [CrossRef] [PubMed]
Xia, X.; Kulis, B. W-Net: A deep model for fully unsupervised image segmentation. arXiv 2017, arXiv:1711.08506. [Google Scholar]
Litjens, G.; Ciompi, F.; Wolterink, J.M.; de Vos, B.D.; Leiner, T.; Teuwen, J.; Isgum, I. State-of-the-art deep learning in cardiovascular image analysis. JACC Cardiovasc. Imaging 2019, 12, 1589–1601. [Google Scholar] [CrossRef]
Kim, W.; Kanezaki, A.; Tanaka, M. Unsupervised learning of image segmentation based on differentiable feature clustering. IEEE Trans. Image Process. 2020, 29, 8055–8068. [Google Scholar] [CrossRef]
Ouyang, D.; He, B.; Ghorbani, A.; Yuan, N.; Ebinger, J.; Langlotz, C.P.; Heidenreich, P.A.; Harrington, R.A.; Liang, D.H.; Ashley, E.A.; et al. Video-based AI for beat-to-beat assessment of cardiac function. Nature 2020, 580, 252–256. [Google Scholar] [CrossRef]
Zhou, T.; Canu, S.; Ruan, S. A review: Deep learning for medical image segmentation using multi-modality fusion. Array 2019, 3, 100004. [Google Scholar] [CrossRef]
Chen, C.; Qin, C.; Qiu, H.; Tarroni, G.; Duan, J.; Bai, W.; Rueckert, D. Deep Learning for Cardiac Image Segmentation: A Review. Front. Cardiovasc. Med. 2020, 7, 25. [Google Scholar] [CrossRef]
Lin, Z.; Lei, C.; Yang, L. Modern Image-Guided Surgery: A Narrative Review of Medical Image Processing and Visualization. Sensors 2023, 23, 9872. [Google Scholar] [CrossRef]
Cruz-Aceves, I.; Avina-Cervantes, J.G.; Lopez-Hernandez, J.M.; Garcia-Hernandez, M.G.; Ibarra-Manzano, M.A. Unsupervised Cardiac Image Segmentation via Multiswarm Active Contours with a Shape Prior. Comput. Math. Methods Med. 2013, 2013, 909625. [Google Scholar] [CrossRef] [PubMed]
Fozilov, K.; Colan, J.; Davila, A.; Misawa, K.; Qiu, J.; Hayashi, Y.; Mori, K.; Hasegawa, Y. Endoscope Automation Framework with Hierarchical Control and Interactive Perception for Multi-Tool Tracking in Minimally Invasive Surgery. Sensors 2023, 23, 9865. [Google Scholar] [CrossRef] [PubMed]
Erkmen, H.; Schulze, H.; Wiesmann, T.; Mettin, C.; El-Monajem, A.; Kron, F. Sensing Technologies for Guidance During Needle-Based Interventions. Sustainability 2023, 13, 1224. [Google Scholar] [CrossRef]
Seetharam, K.; Raina, S.; Sengupta, P.P. The Role of Artificial Intelligence in Echocardiography. Curr. Cardiol. Rep. 2020, 22, 99. [Google Scholar] [CrossRef]
Vesal, S.; Gu, M.; Kosti, R.; Maier, A.; Ravikumar, N. Adapt Everywhere: Unsupervised Adaptation of Point-Clouds and Entropy Minimisation for Multi-modal Cardiac Image Segmentation. arXiv 2021, arXiv:2103.08219. [Google Scholar] [CrossRef]
Kalra, A.; Kumar, V.; Chung, E.; Biswas, L.; Kuruvilla, S.; Zoghbi, W.A.; Gilliam, L. Unsupervised Myocardial Segmentation for Cardiac BOLD MRI. J. Cardiovasc. Magn. Reson. 2020, 22, 89–95. [Google Scholar]
Ding, X.; Han, Z. A Semi-Supervised Approach Combining Image and Frequency Enhancement for Echocardiography Segmentation. IEEE Access 2024, 12, 92549–92559. [Google Scholar] [CrossRef]
Abdi, A.H.; Luong, C.; Tsang, T.; Allan, G.; Nouranian, S.; Jue, J.; Hawley, D.; Fleming, S.; Gin, K.; Swift, J.; et al. Automatic Quality Assessment of Echocardiograms Using Convolutional Neural Networks: Feasibility on the Apical Four-Chamber View. IEEE Trans. Med. Imaging 2017, 36, 1221–1230. [Google Scholar] [CrossRef]
Lang, R.M.; Badano, L.P.; Mor-Avi, V.; Afilalo, J.; Armstrong, A.; Ernande, L.; Foster, R.M.; Goldstein, E.A.; Kuznetsova, S.; Lancellotti, L.; et al. Recommendations for cardiac chamber quantification by echocardiography in adults: An update from the American Society of Echocardiography and the European Association of Cardiovascular Imaging. J. Am. Soc. Echocardiogr. 2015, 28, 1–39. [Google Scholar] [CrossRef]
Aganj, I.; Harisinghani, M.G.; Weissleder, R.; Fischl, B. Unsupervised Medical Image Segmentation Based on the Local Center of Mass. Sci. Rep. 2018, 8, 13012. [Google Scholar] [CrossRef]
Yang, J.; Ding, X.; Zheng, Z.; Xu, X.; Li, X. GraphEcho: Graph-Driven Unsupervised Domain Adaptation for Echocardiogram Video Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 4–6 October 2023; pp. 11878–11887. [Google Scholar] [CrossRef]
Ye, K.; Liu, P.; Zou, X.; Zhou, Q.; Zheng, G. KiPA22 Report: U-Net with Contour Regularization for Renal Structures Segmentation. In Proceedings of the KiPA22 Conference, Shanghai Jiao Tong University, Shanghai, China, 8–12 August 2022. [Google Scholar]
Bishop, C.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Kornilov, A.; Safonov, I.; Yakimchuk, I. A Review of Watershed Implementations for Segmentation of Volumetric Images. J. Imaging 2022, 8, 127. [Google Scholar] [CrossRef] [PubMed]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI 2015), Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]

Figure 1. Diagram of the neural network architecture used for unsupervised segmentation. The network consists of an encoder–decoder structure with skip connections, downsampling and upsampling layers, and convolutional operations. The input batch of echocardiogram images is progressively encoded, with the middle bottleneck capturing deep features, followed by decoding to produce both the reconstructed image and semantic segmentation masks. The network is optimized using reconstruction loss, similarity and contour regularization loss.

Figure 2. Score distribution for training, validation, and test datasets.

Figure 3. Training and validation loss over epochs, highlighting the smallest losses for both training and validation at epoch 32. The model was saved at this point, as it achieved the best validation loss of 0.0602.

Figure 4. Reduction in the number of segmentation masks as training progresses, illustrating the refinement in segmentation and the need for additional processing to extract chamber volumes.

Figure 5. Watershed segmentation pipeline. Top row, from left to right: (1) original image, (2) segmented image after the CNN model, (3) pre-processing for the watershed by isolating the background and heart walls. Bottom row, from left to right: (4) after dilation and closing to reduce internal noise in the structures, (5) distance transform applied, (6) final segmentation using the 3D watershed algorithm.

Figure 6. Comparison between the network output and the 3D Watershed segmentation. (a) Network output: the network accurately segments the heart wall but does not isolate the left ventricle chamber, leading to a segmentation that includes both the heart wall and the chamber. (b) Watershed 3D: after applying the Watershed algorithm, the left ventricle chamber is successfully isolated from the surrounding heart wall, providing a clearer and more anatomically accurate segmentation.

Figure 7. Comparison of segmentation results for Systolic (top row) and Diastolic (bottom row) frames: (a) original frame, (b) ground truth mask, (c) our model mask, (d) EchoNet mask.

Figure 8. Good performance: original vs. segmented images for 2 frames of 2 different videos. Good performance images show well-defined boundaries and accurate segmentation.

Figure 9. Bad performance: original vs segmented images for 2 frames of 2 different videos. Poor performance examples exhibit incorrect boundary detection or incomplete segmentation.

Table 1. Comparison of segmentation accuracy using Dice and IoU between our method (Unsupervised) and EchoNet Dynamic (Supervised), including 95% confidence intervals (CI).

Metric	Method	Value	95% CI
Dice Coefficient	Our Method (Unsupervised)	0.8225	[0.8098, 0.8353]
	EchoNet Dynamic (Supervised)	0.9241	[0.9207, 0.9275]
IoU	Our Method (Unsupervised)	0.7071	[0.6928, 0.7213]
	EchoNet Dynamic (Supervised)	0.8600	[0.8542, 0.8657]

Table 2. Comparison of Ejection Fraction (EF) calculation accuracy using MAE, MAPE, and RMSE between our method and EchoNet Dynamic.

Metric	Our Method vs. GT	EchoNet D vs. GT
MAE	13.7057	13.1116
MAPE	28.01%	25.38%
RMSE	16.9375	18.5931

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cacao, G.F.; Du, D.; Nair, N. Unsupervised Image Segmentation on 2D Echocardiogram. Algorithms 2024, 17, 515. https://doi.org/10.3390/a17110515

AMA Style

Cacao GF, Du D, Nair N. Unsupervised Image Segmentation on 2D Echocardiogram. Algorithms. 2024; 17(11):515. https://doi.org/10.3390/a17110515

Chicago/Turabian Style

Cacao, Gabriel Farias, Dongping Du, and Nandini Nair. 2024. "Unsupervised Image Segmentation on 2D Echocardiogram" Algorithms 17, no. 11: 515. https://doi.org/10.3390/a17110515

APA Style

Cacao, G. F., Du, D., & Nair, N. (2024). Unsupervised Image Segmentation on 2D Echocardiogram. Algorithms, 17(11), 515. https://doi.org/10.3390/a17110515

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Unsupervised Image Segmentation on 2D Echocardiogram

Abstract

1. Introduction

2. Related Work

2.1. Deep Learning for Image Segmentation in Medical Image Analysis

2.2. Image Segmentation in Cardiology

2.3. Deep Learning in Echocardiography

2.4. Unsupervised Medical Image Segmentation

2.5. Challenges and Limitations

3. Methods

3.1. Architecture

3.2. Loss Functions

3.3. Training Process

Pre-Processing the Network Inputs

3.4. Post-Processing Through 3D Watershed Segmentation

3.5. Performance Evaluation

3.5.1. Visual Inspection of Segmentation Results

3.5.2. Quantitative Evaluation and Mask Extraction

4. Results

4.1. Quantitative Analysis

4.1.1. Segmentation Accuracy

4.1.2. Ejection Fraction

4.2. Qualitative Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. Quality Assessment of Selected vs. Non-Selected Videos

Appendix A.2. Training and Validation Loss for Different Frame Sizes

Appendix A.3. Grid Search and Weight Selection

Appendix A.4. Ablation Study

Appendix A.5. Comparison with State-of-the-Art Models

Appendix A.6. Extra Results

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI