Next Article in Journal
Organic Semiconducting Polymers in Photonic Devices: From Fundamental Properties to Emerging Applications
Previous Article in Journal
Simulation of an M 7.1 Lateral Fault Coastal Earthquake: A Plausible Scenario for Seismic Hazard Assessment in Michoacan, Mexico
Previous Article in Special Issue
Study on Few-Shot Object Detection Approach Based on Improved RPN and Feature Aggregation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Benchmarking Anomaly Detection Methods for Extracardiac Findings in Cardiac MRI

by
Edgar Pinto
1,2,3,
Patrícia M. Costa
4,
Catarina Silva
5,
Vitor H. Pereira
1,2,6,
Jaime C. Fonseca
3 and
Sandro Queirós
1,2,*
1
Life and Health Sciences Research Institute (ICVS), School of Medicine, University of Minho, 4710-057 Braga, Portugal
2
ICVS/3B’s—PT Government Associate Laboratory, 4710-057 Braga, Portugal
3
Algoritmi Center, School of Engineering, University of Minho, 4800-058 Guimarães, Portugal
4
Department of Radiology, Hospital CUF Viseu, 3500-612 Viseu, Portugal
5
Department of Radiology, Unidade Local de Saúde do Alto Ave, 4835-044 Guimarães, Portugal
6
Cardiology Department, Unidade Local de Saúde de Braga, 4710-243 Braga, Portugal
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(7), 4027; https://doi.org/10.3390/app15074027 (registering DOI)
Submission received: 27 February 2025 / Revised: 25 March 2025 / Accepted: 2 April 2025 / Published: 6 April 2025
(This article belongs to the Special Issue Advanced Image Analysis and Processing Technologies and Applications)

Abstract

:
In cardiac magnetic resonance (MR) imaging, an initial set of sequences is acquired to guide the definition of the subsequent cardiac views. These sequences provide a large field of view, enabling the detection of extracardiac findings (ECFs). Although ECFs may have significant clinical relevance, they are typically overlooked since they fall outside the scope of cardiac examinations. The only prior attempt to automatically detect incidental ECFs employed fully supervised methods but faced substantial limitations due to the impracticality of collecting comprehensive samples given the wide range of possible anomalies across various organs. This study investigates the potential of recent anomaly detection (AD) methods to address this challenge. While AD methods have gained popularity, their application has been largely confined to industrial settings or medical imaging tasks such as brain MR or chest X-ray, which exhibit lower anatomical variability and complexity than cardiac MR anatomical sequences. Hereto, twenty state-of-the-art (SOTA) AD methods, including unsupervised, semi-supervised, and open-set supervised learning methodologies, are compared against two fully supervised baselines for detecting ECFs in anatomical planes of cardiac MR. Results from our in-house dataset reveal suboptimal performance of SOTA AD methods, highlighting the need for further research in this domain.

1. Introduction

In recent years, magnetic resonance (MR) imaging has gained increasing relevance in cardiology due to its high image resolution, superior contrast, and radiation-free nature [1]. Cardiac magnetic resonance (CMR) is now indispensable in clinical practice [2] and is widely recognized as the gold standard for assessing cardiac chamber volumes, cardiac tissue health, and myocardial viability. A CMR exam begins with the acquisition of localizer images in the axial, coronal, and sagittal planes to align the MR setup with the patient’s position. These initial images then guide the acquisition of multi-slice sequences along the same directions, known as anatomical planes, which provide a comprehensive view of the heart and facilitate the definition of cardiac axes for subsequent cardiac-focused sequences [3]. These anatomical planes capture a broad thoracic region, encompassing multiple organs, and may reveal non-cardiac findings that can aid in diagnosing primary cardiac pathology or guiding appropriate patient management [4,5].
Extracardiac findings (ECFs) are defined as any non-cardiac abnormality, categorized by clinical relevance into major, significant, and insignificant findings, which may be incidentally detected in these preliminary sequences. Studies have shown that approximately 36 ECFs are identified per 100 CMR examinations, with nearly half being clinically meaningful [4,6,7,8,9]. Despite their potential significance, these preliminary images are often disregarded as they fall outside the primary scope of cardiac examinations. Moreover, the wide spectrum of potential abnormalities increases the likelihood of overlooking them. Thus, automated detection of ECFs could assist clinicians by highlighting suspected abnormalities, potentially prompting timely referrals to other specialties.
To the best of the authors’ knowledge, the study by Wickremasinghe et al. [5] stands as the only attempt to automate ECF detection in CMR HASTE images. Their work explored two deep-learning-based supervised approaches: binary classification at the image level and multi-label classification at the organ level. However, their results exhibited low sensitivity [5], falling short of acceptable values for clinical application. This shortfall can be attributed to inherent challenges in supervised methods, including their difficulty in handling small labeled datasets and inadequate outlier clustering capabilities.
Conversely, anomaly detection (AD) methods shift the focus to the normal class, which is easier to model/cluster due to its relative uniformity and data abundance. With this methodology, anomalies can be identified simply as deviations [10,11]. These methods have recently found utility in various domains, including brain MR [12,13,14,15,16,17,18,19,20,21,22], thoracic X-ray [18,23,24], and industrial imaging [25,26,27,28,29,30,31,32,33], all of which involve images with lower structural complexity and variance. Since achieving satisfactory performance on such datasets does not guarantee favorable outcomes on datasets characterized by higher variance in normality, our study aims to investigate the effectiveness of recent AD methods in addressing a more complex task—detecting ECFs in CMR images. To achieve this, we conducted a comparative analysis of state-of-the-art unsupervised, semi-supervised, and open-set supervised AD methodologies and compared them with two supervised baselines.
In summary, this work presents the following contributions:
  • Systematic evaluation of AD methods for detecting ECFs in CMR images, considering an extensive set of approaches, including unsupervised, semi-supervised, and open-set supervised methodologies;
  • Comparison of these AD benchmarked methods with two fully supervised baselines, allowing a better perception of the effectiveness of these AD methods;
  • In-depth discussion of the strengths and weaknesses of AD methods in a challenging dataset, highlighting possible future directions.

2. Related Work

Anomaly detection methods can be categorized based on their learning principles into unsupervised, semi-supervised, and open-set supervised approaches (Figure 1) [10]. Unsupervised AD methods, the most prevalent category, use a curated dataset comprising only normal images to model the normal class, detecting anomalies as deviations from this learned distribution without supervision on what constitutes an anomaly [18,23,34]. In contrast, semi-supervised AD methods incorporate normal, abnormal, and unlabeled samples to define a compact cluster representing normality [23,35]. Within this category, one-class semi-supervised AD methods rely exclusively on normal and unlabeled samples [23]. Finally, open-set supervised methods utilize labeled normal and abnormal samples by combining supervised with unsupervised mechanisms. They may use abnormal samples to either serve as negative samples to refine the modeling of the normal class distribution [33] or explicitly guide the model in recognizing abnormal patterns in images [27].
Given the broad range and diversity of the methods addressed in this study, the rest of this section provides a concise overview of the benchmarked AD methods, along with other relevant approaches from the literature. This overview is intended to provide sufficient background to understand the subsequent sections of the manuscript. For a more detailed explanation of the methods used in this work, the readers are referred to the original papers or relevant AD review articles.
According to [18], unsupervised AD methods can be further divided into four categories: image reconstruction, feature modeling, attention-based, and self-supervised methods [18]. Image reconstruction methods typically employ autoencoder (AE) networks to learn the underlying data distribution while enforcing a low-dimensional latent space to prevent identity mappings. Kingma and Welling [36] enhanced the regularization effect on these networks by introducing the variational autoencoder (VAE), modeling a likelihood distribution of the data conditioned on latent variables, which is simpler than explicitly modeling the data distribution. However, VAEs often produce blurry reconstructions, which prompted You et al. [14] to developed a restoration framework, known as r-VAE, that selectively refines abnormal regions (i.e., those with low ELBO values) using maximum a posteriori estimation. This approach penalizes deviations from both the input image and a learned normality distribution. To further mitigate blurry reconstructions, generative adversarial networks (GANs) leverage adversarial loss to enhance image realism. In this context, Schlegl et al. proposed anoGAN [37], which seeks to find the latent representation of the most similar normal and realistic version of the query image. This is achieved by optimizing a loss function that penalizes reconstruction errors and discrepancies in discriminator-based classification, where the discriminator is trained to distinguish normal from abnormal images based on similarity to the training data. Later, the same authors proposed f-anoGAN [38], a more efficient version that, instead of computing the latent representation through iteratively backpropagating the loss with frozen model parameters, directly maps the image to the latent space through an encoder network. They also adopted a Wasserstein GAN to enhance training stability and mitigate mode collapse. Beyond convolutional neural networks, Ghorbel et al. [15] introduced a hierarchical transformer AE network with skip connections (H-TAE-S), which enables simultaneous learning of local and global features by leveraging the long-range dependencies captured by transformer architectures. More recently, denoising diffusion probabilistic models (DDPMs) and diffusion models in general emerged as state-of-the-art techniques in image reconstruction. These models operate via a Markovian process of noising and denoising [13,22,39]. In the AD domain, Wyatt et al. proposed anoDDPM [22], which employs a partial diffusion process to selectively correct regions deviating from the learned distribution rather than generating entirely new samples. Additionally, rather than relying on Gaussian noise, they utilize Simplex noise as it corrupts primarily low-frequency components, and consequently abnormalities, preserving high-frequency structures that act as landmarks during the reverse process. To further enhance reconstruction quality and reduce false positives, Behrendt et al. proposed patched DDPM (pDDPM), where a partial diffusion process is applied to patches rather than to the entire image [13]. Since this approach is computationally demanding, they opted for a one-step prediction framework, leveraging contextual information from surrounding patches to improve anatomical consistency.
Image reconstruction methods often struggle with edges and complex textures. To address these challenges, feature modeling methods have emerged, focusing on detecting anomalies in the feature space rather than directly in the image space [19,31]. These methods can be further categorized into two subgroups: those based on reconstruction errors [19,26,29,31] and those relying on statistical modeling of feature maps [25,28]. Shi et al. [31] introduced the deep feature reconstruction (DFR) method, which extracts hierarchical features from a pre-trained network, aligns and resizes them to match the input size, and applies mean filtering to form a single multi-scaled feature volume. This volume is then processed through an AE with 1 × 1 convolutions. Similarly, Meissen et al. [19] proposed the feature autoencoder (FAE) method, which employs an AE with a larger kernel size and a structural similarity index measure (SSIM) loss to improve robustness compared to simple pixel-wise differences. However, unlike DFR, FAE does not incorporate feature map alignment or mean filtering. Teacher–student models are another way of using feature reconstruction errors to detect anomalies. Deng et al. [26] proposed a reverse distillation (RD) model, utilizing heterogeneous teacher and student networks along with a one-class bottleneck embedding module to enforce the networks to learn distinguishable filters. The anomaly map is computed as the averaged difference between feature maps from corresponding layers of the teacher and student networks. Notably, all these feature modeling methods rely on frozen pre-trained backbones for feature extraction. In contrast, Guo et al. [29] introduced ReContrast, an approach that optimizes the entire network to better align with the target domain. This is achieved through a contrastive learning paradigm, where a shared decoder is paired with both an unfrozen encoder trained on the target domain and a frozen one from the pre-training domain. The loss is computed between the representations of both encoders but is backpropagated only through the target-domain encoder. Departing from these approaches, Defard et al. [25] proposed PaDiM, which models normal features as a multivariate Gaussian distribution and detects abnormal regions based on the Mahalanobis distance between image features and this distribution. Additionally, some methods have sought to model normal data with more complex statistical frameworks. Gudovskiy et al. [28] introduced CFLOW-AD, which applies a sequence of bijective transformations to a simpler distribution using normalizing flows, enabling the generation of more complex data distributions.
As an alternative to detecting anomalies in AE models, attention-based methods leverage attention maps. Liu and Li [30] proposed the explanatory variational autoencoder (expVAE), which localizes anomalies by computing gradient-based attention maps under the assumption that the posterior distribution of normal samples matches the prior distribution. Later, Silva-Rodríguez et al. [20] identified a relationship between these gradient-based attention maps and the activation maps generated by the encoder. Building on this insight, they proposed AMCons, which uses these activation maps to detect anomalous features while incorporating an entropy loss to enforce an evenly distributed attention across all normal features.
In turn, self-supervised methods generate synthetic abnormal samples by randomly adding noise [13,16,17,22] or inserting image patches [21,24,32] into normal images, followed by either restoring the original image or segmenting the abnormal region. The denoising autoencoder (DAE) assumes that abnormalities can be simulated by adding noise to the query image, which is subsequently denoised using a U-Net architecture [16,17]. Another approach for generating anomalies involves inserting image patches at random locations, as seen in Poisson image interpolation (PII) [24] and CutPaste [32]. Unlike CutPaste, PII employs Poisson interpolation to seamlessly blend patch edges into the target image.
A representative example of a one-class semi-supervised AD method is the dual-distribution discrepancy for anomaly detection (DDAD) method [23]. This method trains an ensemble of AEs using only normal images, learning a normality distribution that allows reconstruction of normal structures. Simultaneously, a second ensemble of AEs is trained on unlabeled images, learning a broader distribution capable of reconstructing both normal and abnormal features. The intra- and inter-discrepancy between the reconstructions of these two AE ensembles is then leveraged to generate an anomaly map, which is further refined by a second-stage network trained through self-supervision [23].
Finally, DRA and BGAD are two examples of open-set supervised AD methods [27,33]. DRA employs a multi-head architecture with four classification heads, where three are dedicated to detecting abnormality patterns—two using unsupervised techniques and one via supervised learning—while the fourth head, trained with supervision, assigns a normality score to the query image [27]. Conversely, BGAD extracts features using a pre-trained network and utilizes normalizing flows to model the normality distribution. In this method, abnormal samples and corresponding features are used to refine the normality distribution through a boundary optimization loss [33].

3. Materials and Methods

3.1. CMR Dataset

This study utilizes a dataset of CMR HASTE anatomical sequences acquired in the axial, sagittal, and coronal planes during routine clinical practice at Hospital of Braga (HB; Portugal) between 2018 and 2019. Retrospectively collected with ethical approval from HB’s Ethics Committee (ref. 180/2023), the dataset includes 691 cases, totaling 35,071 DICOM images. All images were annotated by one of two radiologists regarding the presence of ECFs. When present, the findings were coarsely segmented by the expert. The images reveal a wide variety of findings with different prevalence (Figure 2), including liver nodules, spleen nodules, vesicular lithiasis, lung nodules, pleural effusions, bronchiectasis, non-specific lung alterations, mediastinal changes, hiatal hernias, breast nodules, aortic aneurysms, bone nodules, ascites, adrenal nodules, breast implants, and structural anomalies. Table 1 summarizes some key dataset statistics, namely the proportion of normal and abnormal samples at the patient, sequence, and image levels.

3.2. Dataset Pre-Processing

Since the CMR dataset comprises multiple anatomical sequences, images were first organized by view. Subsequently, all images were min–max normalized and saved in PNG format, accompanied by their respective foreground and ground truth (GT) masks. The foreground mask was generated by applying a thresholding operation to roughly identify the body, followed by a morphological closing operation to fill inner holes and keeping only the largest connected component. Regarding the GT mask, since classifying the specific type of ECF was not a focus of this study, all segmentation maps were merged into a single binarized mask. Images containing at least one abnormal pixel were labeled as abnormal.
Since AD methods are often designed for datasets with low-resolution images, such as MvTec-AD [25,26,27,29,30,31,33] or MNIST [30], an image size of 128 × 128 pixels was adopted to streamline architecture and hyperparameter tuning. This resolution also improves computational efficiency and ensures compatibility across all benchmarked methods while preserving most image details, including small abnormalities. Most images originally had a resolution of 256 × 256 pixels and were resized using Lanczos interpolation, followed by center cropping when necessary. Foreground and GT masks were resized using nearest-neighbor interpolation.
To ensure robust evaluation, each view subdataset was randomly split five times into training, validation, and test subsets in a patient-disjoint manner. Since the benchmark includes supervised, semi-supervised, and unsupervised methods, both normal and abnormal images were included in all subsets. Balanced evaluation subsets were created by sampling an equal number of normal and abnormal images for both validation and test subsets (100 and 500 each in the validation and test subsets, respectively). The remaining available images from unsampled patients were assigned to the training subset. Additionally, the distribution of samples across subsets was qualitatively assessed, and the sampling process repeated until similar abnormality class prevalence was achieved across subsets. Importantly, normal images from abnormal sequences were excluded from all subsets to prevent the unintentional inclusion of subtle abnormal patterns in otherwise visually normal images. Consequently, normal images were only sampled from sequences entirely free of abnormalities.

3.3. Benchmark Methods

Due to the anisotropic nature of the sequences—characterized by a significantly larger interslice distance compared to in-plane resolution and a small number of slices per sequence—and the limited number of sequence samples overall, only 2D anomaly detection methods were explored. This work builds upon the unsupervised AD benchmark conducted by Lagogiannis et al. [18], including image reconstruction (VAE [36], r-VAE [14], f-AnoGAN [38], and H-TAE-S [15]), feature modeling (DFR [31], FAE [19], RD [26], CFLOW-AD [28], and PaDiM [25]), attention-based (expVAE [30] and AMCons [20]), and self-supervised approaches (DAE [16], CutPaste [32], and PII [24]). Additionally, further image reconstruction (AE [23] and OS-DDPM [13]) and feature modeling (ReContrast [29]) methods were included. The OS-DDPM, which stands for One-Step DDPM, is a variant of pDDPM [13] that processes the entire image instead of using a patch-based approach. This approach, resembling AnoDDPM [22], predicts the denoised image in a single inference step. The benchmark was further expanded to incorporate a one-class semi-supervised method (DDAD [23]) and two open-set supervised AD methods (DRA [27] and BGAD [33]). To assess the effectiveness of these AD strategies, all methods were compared against two fully supervised baselines: (1) a binary image classifier, henceforth named SupIC, derived from the DRA method [27] and employing a ResNet-18 backbone with two fully connected layers (of dimensions 256 and 1, respectively); (2) a U-Net binary segmentation network [40], henceforth named SupIS, derived from the DAE method [16] and featuring four sets of convolutional layers, a balanced batch sampler (2/3 normal and 1/3 abnormal samples), a cosine decay learning rate scheduler [41], and a loss function combining cross-entropy and soft dice loss. The selection of these networks allows a fair comparison between AD and fully supervised methodologies.

3.4. Models’ Architecture and Tuning

Among the methods described in [18], only the architecture of expVAE [30] was modified. Originally designed for the MNIST dataset, where images are only 28 × 28 pixels, the network consisted of just two encoding and decoding layers. However, due to the higher resolution of CMR HASTE images, additional layers were introduced to extract more meaningful features. To enable a fair comparison, not only between the two attention-based methods but also with the image reconstruction approaches, the architecture of expVAE was aligned with that of VAE. The architectures of all other methods remained unchanged from their respective original implementations.
Since the CMR dataset used in this work differs in terms of imaging modality and/or anatomical region from those utilized in [13,14,15,16,18,19,20,23,24,25,26,27,28,29,30,31,32,33,38], hyperparameter tuning was performed. First, the number of training steps/epochs was adjusted to ensure proper convergence in our dataset. Method-specific parameters were then tested. For the methods included in [18], we tuned the hyperparameters they had previously optimized, as well as those differing from the respective original work. For the remaining methods, we tuned the hyperparameters identified as important by their respective authors. The hyperparameter combination yielding the best average performance across the five validation subsets was selected to ensure an unbiased evaluation on the test subsets. Table A1 details the hyperparameters modified based on validation set performance, while those that remained unchanged from their respective code bases were omitted for brevity.

3.5. Post-Processing

Reconstruction-based methods typically generate anomaly maps by computing the Manhattan or Euclidean pixel-wise distance between an image and its reconstructed version. However, some studies suggest employing more robust dissimilarity metrics, such as cosine similarity, widely used to compute differences at the feature level [26,29], or SSIM. Unlike pixel-wise intensity comparisons, SSIM evaluates luminance, contrast, and structural differences at the patch level, offering a more perceptually meaningful assessment. In this study, SSIM was explored as an alternative to residual distance in various methods, including VAE, r-VAE, f-anoGAN, H-TAE-S, OS-DDPM, DFR, and DAE. As in the hyperparameter tuning process, the approach (SSIM or the original residual function) that achieved the best validation performance was selected for final test evaluation (see Table A1).
To refine the anomaly maps, the computed scores were multiplied by the foreground mask, effectively removing artifacts outside the body region. For image-level classification, global average pooling was applied. Note that this does not apply to the DRA and SupIC methods, which directly predict an anomaly score for the image.

3.6. Metrics

Among the most commonly used metrics in the AD field, average precision (AP) was selected to assess the pixel-level classification. AP is defined as the area under the precision–recall curve [42], where higher values indicate better separability between classes and, consequently, improved performance. As a multi-threshold metric, pixel-wise AP (pixel-AP) effectively accounts for abnormalities of varying sizes by ignoring true negatives and emphasizing the rare class (anomalies). In contrast, Dice and intersection over union metrics, which calculate the overlap between the predicted map and the GT map, were not considered due to their threshold dependency and sensitivity to high variability in anomaly size [42]. Given the uncertainty and coarse nature of the GT segmentation maps, as well as the disproportionate size of structural anomalies—often covering most of the image—images containing structural anomalies were excluded from pixel-level performance evaluation and were instead considered only in the image-level assessment described below.
For image-level classification, AP was again used. Since both validation and test subsets contain an equal number of positive and negative samples, the area under the receiver operator characteristic curve (AUROC) was also employed to enhance the interpretability of the results. Similarly to AP, higher values of AUROC indicate a better class separability and, consequently, improved performance. These metrics, henceforth referred to as sample-wise AP (sample-AP) and sample-wise AUROC (sample-AUROC), are multi-threshold evaluation metrics, making them well suited to provide a comprehensive and reliable assessment [18,42].

3.7. Statistical Analysis

To aid model assessment and identify significant differences between methods, the Iman–Davenport nonparametric test was performed [43,44]. A post hoc Finner test [45] was then conducted for pairwise assessments. For the sake of space, reported results focus only on comparisons between each state-of-the-art AD method and fully supervised baselines. All tests were performed with a significance level of 0.05.

4. Results and Discussion

The test set results for the coronal, sagittal, and axial subdatasets are summarized in Table 2, representing the average across the five subdataset versions. To improve metric interpretability, the expected performance of a random classifier is also included. For a comprehensive qualitative evaluation, three sample images from each subdataset were selected based on the presence of anomalies varying in size, shape, and contrast, as well as their location across different regions of the chest. These images, which also illustrate the inherent variability of normal anatomy, are shown in Figure 3 and will be referenced throughout this section.
In general, all AD methods—except for DAE [16] and OS-DDPM [13] in the coronal and sagittal subdatasets, respectively—achieved lower image-level classification performance compared to the supervised baselines. However, the performance of DAE [16] and OS-DDPM [13] did not exhibit statistically significant differences from either supervised baseline. Notably, several methods, including r-VAE [14], FAE [19], RD [26], ReContrast [29], PII [24], DAE [16], and BGAD [33], achieved lower average image-level results but showed no statistically significant differences from the supervised baselines for at least one subdataset. At the pixel level, OS-DDPM [13], FAE [19], ReContrast [29], PII [24], and BGAD [33] also demonstrated no statistically different performance from the baselines.
Importantly, the performance of the supervised baselines is strongly influenced by the proportion of abnormal images available for training. When the number of abnormal samples was reduced to 50%, 25%, and 10% of those available in the training subset, their performance decreased significantly, particularly for the SupIC method (Table 2). Under these conditions, several top-performing unsupervised AD methods surpassed the baselines. While these results highlight the dependence of supervised methods on abnormal training data, they also suggest that, in this complex dataset, supervised learning of abnormal patterns remains generally more effective than current state-of-the-art AD approaches. Despite this, their performance remains insufficient for clinical application. Indeed, after selecting the optimal threshold on the validation subset by maximizing the difference between the true positive and false positive rate (Youden’s index) and applying it to the test subset, the SupIC and SupIS methods achieved averaged sample-level F1-scores of 64.75% and 59.01%, respectively, across all subdatasets. In terms of sample-level specificity and recall, SupIC attained 78.36% and 59.19%, while SupIS achieved 86.63% and 48.93%, respectively. These results are consistent with the findings of the only previous study in this clinical application [5].
A comparison of the methods’ performance across the three subdatasets reveals that, in general, the methods perform best on the axial subdataset and comparatively worse on the sagittal subdataset. This discrepancy may be attributed to structural complexity differences across views. Unlike coronal and sagittal slices, axial slices capture a narrower range of anatomical structures, making it easier to learn normal patterns. Notably, methods incorporating some form of supervision exhibit even greater improvement in the axial subdataset. This advantage may stem from the more limited spectrum of abnormalities in axial slices, where lungs are the most prominently represented structure.
The following subsections provide a detailed analysis of the results by category.

4.1. Unsupervised Image Reconstruction Methods

Except for OS-DDPM [13], image reconstruction methods generally struggle to detect anomalies at the pixel level and, consequently, at the sample level. As shown in Figure 4, both AE [23] and VAE [36] produce blurred reconstructions due to the dimensionality reduction applied to the feature space. While this reduction prevents the network from learning identity mappings or approximations thereof, it also compromises the quality of image representation learning. In addition, in the bottleneck layer, features are flattened into a 1D vector, eliminating spatial correlations. The loss function further exacerbates this blurring effect by not explicitly constraining the model to produce realistic outputs. These architectural constraints compel the model to focus on learning, within its capacity, latent representations that can be reliably decoded, resulting in the methods learning mean representations that balance accuracy while avoiding significant errors. Consequently, small lesions often disappear (sample S.I in Figure 5), whereas larger lesions induce distinct features unseen by the network and lead to reconstruction errors reflected in the anomaly maps (sample C.III in Figure 5). Curiously, VAE appears to generate slightly more blurred reconstructions than AE, which can be attributed to the Kullback–Leibler (KL) divergence regularization loss term and the stochastic nature of the bottleneck. These factors increase uncertainty during image generation, which may explain the lower pixel-AP values ( 1.89 % on average across all views) observed with the VAE method (Table 2). However, training with a stochastic bottleneck may enhance robustness to hypothetical outliers, improving the model’s ability to revert abnormal regions, as observed in the reconstruction of sample C.II (Figure 4). Despite this, the blurred reconstructions tend to emphasize high-frequency regions, making it difficult to detect abnormalities, as evidenced in all AE and VAE anomaly maps in Figure 5. Logically, this effect is less pronounced in low-frequency regions, such as the lungs, where the reconstruction task is less complex, leading to fewer false positives. Since most axial images predominantly capture the lungs, this explains the better performance of AE and VAE at both pixel and image levels in this view (Table 2).
The restoration framework applied to VAE [14] was designed to enhance the outcome quality by iteratively modifying only the regions of the query image exhibiting a low likelihood of normality. As shown in Figure 4, this approach preserves normal regions more effectively, resulting in higher-quality reconstructions compared to AE and VAE. Additionally, Figure 4 also highlights the method’s enhanced ability to correct abnormal regions. This combination of preserving normal areas and accurately addressing abnormalities led to significant gains in pixel-level classification performance compared to VAE, as reflected in Table 2 and Figure 5 (namely, in sample C.III). Overall, the anomaly maps reveal a reduction in false positives, particularly in high-frequency structures. However, the performance of r-VAE remains highly dependent on achieving an appropriate balance between gradient contributions of the reconstruction and normative terms—a process that is largely empirical. Furthermore, careful tuning of the restoration learning rate and the number of reconstruction steps is essential to prevent the generation of overly blurred or minimally altered restored images.
Unlike previous methods, f-AnoGAN [38] generates reconstructions with finer detail and greater realism, as illustrated in Figure 4. Indeed, it is evident that all abnormalities were corrected and replaced by normal tissue. However, normal regions were also modified, resulting in false positive areas on the anomaly maps, without improving the pixel and image-level results compared to the previously discussed AE-based methods (Table 2). Similar to AE-based methods, f-AnoGAN constrains the information available for decoding in its latent space. As a result, while the global structure of the image is preserved, fine details are altered, even in normal regions. For instance, in the reconstruction of sample S.II, lung lesions were successfully removed. However, healthy parts of the heart and abdominal structures were also altered, reducing the method’s ability to highlight the lesion in the corresponding anomaly map (Figure 5).
In contrast, H-TAE-S [15] tends to reconstruct query images too accurately, as shown in Figure 4. Consequently, anomaly scores remain low across nearly the entire image (Figure 5), leading to poor performance at both pixel and image levels. This is likely due to the skip connections in its transformer AE architecture. Skip connections are generally avoided in image reconstruction AD methods to ensure that meaningful latent representations are learned. In this case, however, their placement allows the decoder to access identity mappings directly from the upper layers, reducing the need to extract relevant features in the deeper layers. A similar performance degradation with this method was also observed by Lagogiannis et al. [18], further supporting this hypothesis.
In traditional image reconstruction methods, the representation of normality often approximates a non-injective function of the data, meaning that different normal input images with similar features may result in highly similar or even identical reconstructions. This limitation often leads to false positives. Diffusion models have significantly mitigated this issue. As observed in Figure 4, diffusion-based reconstructions avoid the mode collapse seen in GANs and produce finer details compared to AE-based approaches. This capability translates into high-quality anomaly maps, as evidenced by samples C.II, C.III, S.I, and S.III in Figure 5. Consequently, diffusion models largely outperform other image reconstruction methods (Table 2). However, OS-DDPM [13] struggles with restoring high-frequency abnormalities (e.g., samples S.I and S.II in Figure 4) due to its use of simplex noise, which primarily corrupts low-frequency signals in the image. As a result, high-frequency abnormal structures often remain uncorrupted and are easily reconstructed, leading to undetected anomalies. To address this limitation, it would be valuable to explore alternative types of noise to achieve a better balance between corrupting low- and high-frequency components.

4.2. Unsupervised Feature Modeling Methods

The results in Table 2 corroborate a previous observation in the literature [18] that AD strategies applied to image features, rather than raw pixel data, are generally more effective. This is likely because feature-based approaches provide a clearer mechanism for learning to distinguish normal and abnormal patterns. Both DFR [31] and FAE [19] are AE-based methods that utilize feature volumes as input instead of raw pixels. Notably, the anomaly maps generated by FAE exhibit greater discriminability than those produced by DFR, as observed in samples C.II, C.III, S.II, and A.I (Figure 6). Two primary factors may explain this difference: (1) the mean filter applied to the hierarchical feature volume before it is fed into the AE network; (2) the use of 1 × 1 convolutions in the AE. According to the authors of DFR [31], the mean filter smooths large transitions across features, improving robustness to noise. However, this process may also obscure small abnormal features within the feature volume. In turn, the 1 × 1 convolutions might limit the network’s ability to learn the broader representation of the query features, reducing its discriminative capacity. These factors suggest that FAE’s architecture is better suited for capturing and preserving the subtle details necessary to distinguish abnormalities, resulting in superior anomaly maps and classification performance ( + 5.06 % , + 5.10 % , and + 5.00 % , respectively, in pixel-AP, sample-AUROC, and sample-AP across the three views).
RD [26] and ReContrast [29] are also feature reconstruction-based methods and have achieved interesting pixel-level and image-level classification results compared to other unsupervised approaches (Table 2). Unlike DFR and FAE, these methods compute residuals between shallow feature maps using a teacher–student network. This strategy produces anomaly maps with finer details, but it also makes them more susceptible to small noisy reconstruction errors. While such sensitivity can be beneficial for localizing small abnormalities, it complicates the accurate delineation of larger lesions, as seen in the anomaly map for sample C.III in Figure 6. The modifications introduced by Guo et al. [29] resulted in improved pixel-level performance ( + 4.49 % across all views), demonstrating that proper backbone fine-tuning can enhance feature discriminability for modeling normality. The positive-pair contrastive learning strategy, combined with the stop-gradient operation, also proved effective in boosting the performance of a vanilla teacher–student network. Notably, the anomaly maps produced by ReContrast appear more blurred compared to those from RD, likely due to the Gaussian filter applied during post-processing.
Compared to the aforementioned feature modeling methods, PaDiM [25] and CFLOW-AD [28] exhibit inferior performance (Table 2). Surprisingly, PaDiM outperforms CFLOW-AD, as noted in Figure 6 and Table 2 ( + 1.51 % , + 7.17 % , and + 6.54 % , respectively, in pixel-AP, sample-AUROC, and sample-AP across the three views). Given the high anatomical variability in our dataset, one would expect the simple Gaussian distribution employed by PaDiM to be insufficient to effectively model the distribution of the extracted features. Instead, methods like CFLOW-AD, which learn more complex distributions from the data, were anticipated to yield superior results. However, this expectation was not met in practice.

4.3. Unsupervised Attention-Based Methods

Regarding attention-based methods, their low accuracy in both pixel-level and image-level classification tasks is evident in Table 2 and further supported by their anomaly maps (Figure 7). In the case of expVAE [30], the generated anomaly maps do not correlate with potential abnormality zones. Instead, they tend to highlight certain anatomical structures without a clear connection to actual anomalies. As described in Section 2, this method is based on the VAE architecture, where the KL divergence term in the loss function is expected to be minimized to ensure that the posterior matches the prior distribution. In theory, this minimization should result in latent embeddings for normal regions having values close to zero, with abnormalities appearing as deviations from this value (i.e., higher absolute values). In this setup, the KL divergence loss term serves as a form of regularization. However, in practice, minimizing the reconstruction error led to an increase in the KL divergence term, suggesting that more representative and stochastically independent latent embeddings were required to capture the wide anatomical variations present in normal samples. Consequently, higher latent embedding values may not necessarily indicate abnormalities but rather reflect the natural diversity of anatomical structures. A similar performance degradation was reported by Lagogiannis et al. [18], particularly on the CheXpert dataset, where expVAE results were comparable to those of a random classifier.
Following the same principles, AMCons [20] also failed to perform effectively, showing a tendency to highlight hyper-intense regions rather than true abnormalities. Moreover, the entropy regularization term introduced to constrain the activation values of normal feature maps may have inadvertently degraded reconstruction quality, causing oscillations between improving reconstruction fidelity and increasing entropy in activation maps. As a consequence, it performs poorly in datasets where anomalies are not characterized by high pixel intensity, as observed in this study and previously noted in [18].

4.4. Unsupervised Self-Supervised Methods

Similar to feature modeling approaches, self-supervised methods yielded competitive results within the unsupervised category, with DAE [16] emerging as one of the best-performing methods overall (Table 2). Examining the corresponding anomaly maps in Figure 7, it is evident that DAE achieves impressive pixel-level results, particularly when lesions resemble noise (e.g., sample C.III). Likewise, PII [24] demonstrated strong pixel-level performance, as illustrated in Figure 7 and reported in Table 2. Since PII is explicitly trained to segment (synthetic) abnormalities, the binary cross-entropy loss helps achieve good specificity at the pixel level. However, because PII is trained with artificially generated squared-shaped abnormalities, it tends to overfit to this shape, often producing square-like patterns in anomaly maps. This limitation is not observed in DAE, as its network is trained to reconstruct a noise-free version of the input normal image, thereby implicitly learning the normal distribution. Nonetheless, DAE appears to slightly overfit to the Gaussian noise distribution added during training, limiting its generalizability to certain anomalies.
Conversely, CutPaste [32] exhibited weaker results, particularly in the coronal and sagittal subdatasets (Table 2). This performance is reflected in its anomaly maps (Figure 7), which show considerable instability, potentially linked to the synthetic abnormalities generated during training. Unlike PII, which applies Poisson interpolation to better blend extracted patches into target images, CutPaste does not apply any fusion technique. As a result, the network can more easily distinguish the pasted patches, increasing the risk of overfitting. Moreover, the high anatomical variability in this dataset appears to demand more complex statistical distributions, which could be better modeled using kernel density estimation if sufficient data were available. These challenges contribute to CutPaste’s inferior pixel-level and image-level classification performance compared to the other two self-supervised methods.

4.5. One-Class Semi-Supervised Methods

Despite an unexpected decrease in performance on the axial view (Table 2), the DDAD method [23] showed overall improvements over the AE method by leveraging abnormal samples to reduce false positives associated with the network’s limited reconstruction capabilities. As shown in Figure 8, most false positives related to high-frequency regions were mitigated. However, these performance gains were not enough to make DDAD competitive with other methods. Indeed, the relative improvement over AE was smaller than that reported by the authors in [23]. Three factors may have contributed to this discrepancy: (1) the high variability of anomalies in our dataset, which exceeds that of their datasets; (2) abnormalities exhibit a wider range of possible locations, reducing the anomaly rate in any given image region; (3) just as AEs struggle to model the normal data distribution, they also face challenges in modeling the abnormal data distribution. These factors weaken the relationship between normal and abnormal reconstructions, hampering the effectiveness of the subsequent refinement of the extracted anomaly maps.

4.6. Open-Set/Weakly Supervised Methods

The BGAD method [33] achieved strong pixel-level performance (Table 2), successfully localizing small abnormalities, such as in sample C.I, and demonstrating robust localization in samples C.III, S.II, A.I, and A.III (Figure 8). This method is comparable to CFLOW-AD, as both networks share a similar pipeline, highlighting the significant performance boost provided by the contrastive supervision employed in BGAD. This remarkable improvement in performance can also be attributed to the two-stage training framework. In the first stage, the normal distribution is learned without contamination from abnormal samples, significantly reducing the risk of overfitting. In the second stage, the model’s discriminative capabilities are enhanced by incorporating examples of abnormalities.
In turn, DRA emerged as one of the best-performing image-level classifiers in our benchmark study (Table 2), demonstrating notable effectiveness despite its simplicity. As seen in the SupIC baseline, DRA’s supervised network head plays a critical role in achieving this performance. One advantage of DRA over SupIC is that the reduction of abnormal training data does not cause as sharp a decline in performance, thanks to the embedded unsupervised mechanisms. However, when only a small number of abnormal training samples are available, DRA’s performance lags behind that of several unsupervised AD methods. This limitation primarily arises from the negative contribution of the supervised head to the abnormality score, which can overturn favorable decisions made by the remaining heads of the model. Additionally, a drawback of DRA is its inability to directly produce pixel-level predictions, restricting its application to identifying abnormal samples without the capability to localize anomalies within the images.

4.7. Fully Supervised Methods

As expected, both SupIC and SupIS are dependent on the amount of abnormal data available during training, though SupIS is less affected by reductions in the amount of abnormal samples. Looking at the anomaly maps in Figure 8, it is evident that SupIS exhibits high specificity, resulting in a low number of false positives but also a limited number of true positives.

4.8. General Challenges and Limitations

The benchmarked methods exhibit several limitations, as evidenced by the results. One key challenge is their difficulty in modeling and recognizing less representative normal structures within the dataset. A specific example is sample C.I, which contains a pericardial effusion—an abnormality not labeled in this study, as it falls within the cardiac domain (and therefore not an ECF). Although samples with pericardial effusion are included in the normal training set, they are underrepresented. As the methods predominantly learn the average appearance of normal images, this underrepresentation can lead to the misclassification of images with pericardial effusion.
Furthermore, all methods treat individual slices as independent samples, failing to account for the spatial continuity of the anatomical structures they capture. While this limitation is expected, given that they were originally designed for 2D image-based tasks, the lack of 3D awareness likely contributed to their underperformance on this dataset. An example of this limitation is seen in sample A.II, which captures the transition between the liver and the right lung. Without integrating 3D positional information, the models may incorrectly identify this transition as an abnormality in the right lung. As shown in Figure 4, most image reconstruction methods removed the liver entirely, except for H-TAE-S and DAE, which apply minimal modifications to all images. Curiously, other types of methods do not exhibit this behavior. This example highlights the importance of incorporating 3D spatial awareness into future approaches to improve performance. This could involve using both anterior and posterior adjacent slices as input or conditioning the model with additional information about the relative spatial location of the input slice in the body.

5. Conclusions

This study presents a comprehensive comparative analysis of unsupervised, semi-supervised, and open-set supervised anomaly detection methods for identifying ECFs in CMR HASTE images. The results reveal that while some state-of-the-art AD methods achieved performance comparable to fully supervised baselines on certain metrics, the latter remained the most reliable option for this challenging dataset. The suboptimal performance of AD methods may be attributed to the high variability and structural complexity of these images. Methods incorporating abnormality supervision, which prioritize learning abnormal patterns over normal variability, are less affected by these challenges but still struggle to generalize to rare or unseen anomalies.
Among the tested approaches, OS-DDPM, DAE, BGAD, DRA, SupIC, and SupIS demonstrated the highest effectiveness in detecting ECFs. However, the performance of supervised methods remains highly dependent on the availability of abnormal training samples. In scenarios with limited abnormal data, unsupervised AD strategies exhibited stronger relative performance. The plateaued performance of fully supervised methods, coupled with the promising results of certain unsupervised AD approaches, underscores the potential for further exploration and refinement of AD strategies in this clinical task.

Author Contributions

Conceptualization, S.Q.; methodology, E.P. and S.Q.; software, E.P.; validation, J.C.F. and S.Q.; formal analysis, E.P. and S.Q.; investigation, E.P. and S.Q.; resources, V.H.P., J.C.F. and S.Q.; data curation, P.M.C., C.S. and V.H.P.; writing—original draft preparation, E.P. and S.Q.; writing—review and editing, E.P., P.M.C., C.S., V.H.P., J.C.F. and S.Q.; supervision, J.C.F. and S.Q.; project administration, S.Q.; funding acquisition, S.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Portuguese National funds, through the Foundation for Science and Technology (FCT)—projects UIDB/50026/2020 (DOI: 10.54499/UIDB/50026/2020), UIDP/50026/2020 (DOI: 10.54499/UIDP/50026/2020), LA/P/0050/2020 (DOI: 10.54499/LA/P/0050/2020), and PTDC/EMD-EMD/1140/2020 (DOI: 10.54499/PTDC/EMD-EMD/1140/2020), and grant CEECIND/03064/2018 (S.Q.; DOI: 10.54499/CEECIND/03064/2018/CP1581/CT0017), and by the project NORTE-01-0145-FEDER-000039, supported by Norte Portugal Regional Operational Programme (NORTE 2020), under the Portugal 2020 Partnership Agreement, through the European Regional Development Fund (ERDF).

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and approved by the Ethics Committee of Hospital of Braga (Braga, Portugal; ref. 180/2023). This is a retrospective study with no intervention. No clinical information was extracted and the images were exported in a pseudoanonymized format without direct identifiers.

Informed Consent Statement

Patient consent was waived due to the retrospective nature of the study and the use of pseudoanonymized imaging data.

Data Availability Statement

The dataset used in this study is not publicly available due to restrictions imposed by the Ethics Committee. Requests for access can be directed to the corresponding author and will be considered on a case-by-case basis.

Acknowledgments

The authors would like to acknowledge the Clinical Academic Center-Braga (2CA-Braga) for its support in submitting the project to the ethics committee of the Hospital of Braga (HB) and facilitating the retrospective collection of image data from HB. Additionally, the authors acknowledge the donation of an RTX A6000 GPU by NVIDIA Corporation (USA).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Hyperparameters changed in our benchmark compared to the original settings.
Table A1. Hyperparameters changed in our benchmark compared to the original settings.
MethodHyperparameters
U/IRAE [23]Latent dimension = 32
VAE [36]Latent dimension = 128
# steps = 20,000
SSIM = False
r-VAE [14]Latent dimension = 128
# steps = 20,000
Restoration LR = 5000
# restoration steps: 20,000
SSIM = True
f-anoGAN [38]SSIM = True
H-TAE-S [15]# steps = 50,000
LR = 1 × 10−5
β 1 = 0.9; β 2 = 0.999
OS-DDPM [13]SSIM: True;
SSIM kernel σ : 1;    SSIM kernel size: 9;
# test timesteps: 200
U/FMDFR [31]# steps = 40,000
FAE [19]-
RD [26]# steps = 30,000
ReContrast [29]# epochs = 30
LR 2 = 1 × 10−6
PaDiM [25]-
CFLOW-AD [28]Backbone architecture: Resnet-18
LR Scheduler: True
# steps = 12,000
U/ABexpVAE [30]Latent dimension = 128
# steps = 20,000
Target layer: 1
AMCons [20]# steps = 150,000
LR = 1 × 10−5
Level CAMs = 3
Latent dimension = 128
U/S-SDAE [16]# steps = 35,200
SSIM = False (coronal subdataset)
SSIM = True (sagittal and axial subdatasets)
CutPaste [32]-
PII [24]# steps = 3000
SS/OCDDAD [23]AE network: MemAE
WSBGAD [33]Data strategy = {0,1}
Meta epochs = 180
DRA [27]Batch size = 5
# steps/epoch = 100
% Abnormal training images = {10%, 25%, 50%, 100%}
SSupIC [27]% Abnormal training images = {10%, 25%, 50%, 100%}
Batch size = 48
# steps/epoch = 20; # epochs = 30
LR = 2 × 10−4
Weight decay = 1 × 10−5
LR scheduler step = 10
LR scheduler γ = 0.1
SupIS [40]% Abnormal training images = {10%, 25%, 50%, 100%}
Batch size = 16
# steps = 12,800
Cross entropy loss weight = 1
Soft dice loss weight = 1
LR = 1 × 10 4
LR scheduler step = 128
LR scheduler period = 100
U: unsupervised; SS: semi-supervised; WS: weak/open-set supervised; S: supervised; IR: image reconstruction; FM: feature modeling; AB: attention-based; S-S: self-supervised; OC: one-class. LR: learning rate.

References

  1. Prasad, S.K.; Pennell, D.J. Clinical role of advanced imaging in cardiology. Dialogues Cardiovasc. Med. 2007, 12, 87–101. [Google Scholar]
  2. Salerno, M.; Sharif, B.; Arheden, H.; Kumar, A.; Axel, L.; Li, D.; Neubauer, S. Recent Advances in Cardiovascular Magnetic Resonance: Techniques and Applications. Circ. Cardiovasc. Imaging 2017, 10, e003951. [Google Scholar] [CrossRef] [PubMed]
  3. Kramer, C.M.; Barkhausen, J.; Bucciarelli-Ducci, C.; Flamm, S.D.; Kim, R.J.; Nagel, E. Standardized cardiovascular magnetic resonance imaging (CMR) protocols: 2020 update. J. Cardiovasc. Magn. Reson. 2020, 22, 17. [Google Scholar] [CrossRef]
  4. Mantini, C.; Mastrodicasa, D.; Bianco, F.; Bucciarelli, V.; Scarano, M.; Mannetta, G.; Gabrielli, D.; Gallina, S.; Petersen, S.E.; Ricci, F.; et al. Prevalence and Clinical Relevance of Extracardiac Findings in Cardiovascular Magnetic Resonance Imaging. J. Thorac. Imaging 2019, 34, 48–55. [Google Scholar] [CrossRef]
  5. Wickremasinghe, D.H.; Khenkina, N.; Masci, P.G.; King, A.P.; Puyol-Antón, E. Automatic Detection of Extra-Cardiac Findings in Cardiovascular Magnetic Resonance. In Proceedings of the Medical Image Understanding and Analysis, Oxford, UK, 12–14 July 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 98–107. [Google Scholar]
  6. Chaosuwannakit, N.; Makarawate, P. Prevalence and clinical significance of incidental extracardiac findings in cardiac magnetic resonance imaging. Kardiochirurgia I Torakochirurgia Pol. J. Thorac. Cardiovasc. Surg. 2018, 15, 241–245. [Google Scholar] [CrossRef]
  7. Rahman, H.; Shawky, A.; Mehana, E. Prevalence and Clinical Significance of Incidental Extracardiac Findings during Cardiac Magnetic Resonance Imaging: A Retrospective Study. Hong Kong J. Radiol. 2023, 26, 241–245. [Google Scholar]
  8. Trussell, T.M.; Kocaoglu, M.; Fleck, R.J.; Taylor, M.D.; Zang, H.; Ollberding, N.J.; Lang, S.M. Extracardiac Findings on Cardiac Magnetic Resonance: A Children’s Hospital Experience. Pediatr. Cardiol. 2023, 44, 1201–1208. [Google Scholar] [CrossRef]
  9. Ufuk, F.; Yavaş, H.; Sağtaş, E.; Kılıç, I.D. The prevalence and clinical significance of incidental non-cardiac findings on cardiac magnetic resonance imaging and unreported rates of these findings in official radiology reports. Pol. J. Radiol. 2022, 87, 207–214. [Google Scholar] [CrossRef]
  10. Chalapathy, R.; Chawla, S. Deep Learning for Anomaly Detection: A Survey. arXiv 2019, arXiv:1901.03407. [Google Scholar]
  11. Pang, G.; Shen, C.; Cao, L.; Hengel, A.V.D. Deep Learning for Anomaly Detection: A Review. ACM Comput. Surv. (CSUR) 2021, 54, 1–38. [Google Scholar] [CrossRef]
  12. Baur, C.; Denner, S.; Wiestler, B.; Navab, N.; Albarqouni, S. Autoencoders for unsupervised anomaly segmentation in brain MR images: A comparative study. Med. Image Anal. 2021, 69, 101952. [Google Scholar] [CrossRef] [PubMed]
  13. Behrendt, F.; Bhattacharya, D.; Krüger, J.; Opfer, R.; Schlaefer, A. Patched Diffusion Models for Unsupervised Anomaly Detection in Brain MRI. In Proceedings of the Medical Imaging with Deep Learning, PMLR, Paris, France, 3–5 July 2024; pp. 1019–1032. [Google Scholar]
  14. Chen, X.; You, S.; Tezcan, K.C.; Konukoglu, E. Unsupervised lesion detection via image restoration with a normative prior. Med. Image Anal. 2020, 64, 101713. [Google Scholar] [CrossRef] [PubMed]
  15. Ghorbel, A.; Aldahdooh, A.; Albarqouni, S.; Hamidouche, W. Transformer based models for unsupervised anomaly segmentation in brain MR images. In Proceedings of the International MICCAI Brainlesion Workshop, Singapore, 18 September 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 25–44. [Google Scholar]
  16. Kascenas, A.; Pugeault, N.; O’Neil, A.Q. Denoising autoencoders for unsupervised anomaly detection in brain MRI. In Proceedings of the International Conference on Medical Imaging with Deep Learning, PMLR, Zurich, Switzerland, 6–8 July 2022; pp. 653–664. [Google Scholar]
  17. Kascenas, A.; Sanchez, P.; Schrempf, P.; Wang, C.; Clackett, W.; Mikhael, S.S.; Voisey, J.P.; Goatman, K.; Weir, A.; Pugeault, N.; et al. The role of noise in denoising models for anomaly detection in medical images. Med. Image Anal. 2023, 90, 102963. [Google Scholar] [CrossRef] [PubMed]
  18. Lagogiannis, I.; Meissen, F.; Kaissis, G.; Rueckert, D. Unsupervised pathology detection: A deep dive into the state of the art. arXiv 2023, arXiv:2303.00609. [Google Scholar] [CrossRef]
  19. Meissen, F.; Paetzold, J.; Kaissis, G.; Rueckert, D. Unsupervised anomaly localization with structural feature-autoencoders. In Proceedings of the International MICCAI Brainlesion Workshop, Singapore, 18 September 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 14–24. [Google Scholar]
  20. Silva-Rodríguez, J.; Naranjo, V.; Dolz, J. Constrained unsupervised anomaly segmentation. Med. Image Anal. 2022, 80, 102526. [Google Scholar] [CrossRef]
  21. Tan, J.; Hou, B.; Batten, J.; Qiu, H.; Kainz, B. Detecting outliers with foreign patch interpolation. arXiv 2020, arXiv:2011.04197. [Google Scholar] [CrossRef]
  22. Wyatt, J.; Leach, A.; Schmon, S.M.; Willcocks, C.G. AnoDDPM: Anomaly Detection with Denoising Diffusion Probabilistic Models Using Simplex Noise. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, New Orleans, LA, USA, 21–24 June 2022; pp. 650–656. [Google Scholar]
  23. Cai, Y.; Chen, H.; Yang, X.; Zhou, Y.; Cheng, K.T. Dual-distribution discrepancy with self-supervised refinement for anomaly detection in medical images. Med. Image Anal. 2023, 86, 102794. [Google Scholar] [CrossRef]
  24. Tan, J.; Hou, B.; Day, T.; Simpson, J.; Rueckert, D.; Kainz, B. Detecting outliers with poisson image interpolation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2021, Strasbourg, France, 27 September–1 October 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 581–591. [Google Scholar]
  25. Defard, T.; Setkov, A.; Loesch, A.; Audigier, R. PaDiM: A Patch Distribution Modeling Framework for Anomaly Detection and Localization. In Proceedings of the International Conference on Pattern Recognition, Montréal, QC, Canada, 20–24 August 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 475–489. [Google Scholar]
  26. Deng, H.; Li, X. Anomaly Detection via Reverse Distillation from One-Class Embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 21–24 June 2022; pp. 9737–9746. [Google Scholar]
  27. Ding, C.; Pang, G.; Shen, C. Catching Both Gray and Black Swans: Open-Set Supervised Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 21–24 June 2022; pp. 7388–7398. [Google Scholar]
  28. Gudovskiy, D.; Ishizaka, S.; Kozuka, K. CFLOW-AD: Real-Time Unsupervised Anomaly Detection with Localization via Conditional Normalizing Flows. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2022; pp. 98–107. [Google Scholar]
  29. Guo, J.; Lu, S.; Jia, L.; Zhang, W.; Li, H. ReContrast: Domain-Specific Anomaly Detection via Contrastive Reconstruction. Adv. Neural Inf. Process. Syst. 2024, 36, 10721–10740. [Google Scholar]
  30. Liu, W.; Li, R.; Zheng, M.; Karanam, S.; Wu, Z.; Bhanu, B.; Radke, R.J.; Camps, O. Towards Visually Explaining Variational Autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 8642–8651. [Google Scholar]
  31. Shi, Y.; Yang, J.; Qi, Z. Unsupervised anomaly segmentation via deep feature reconstruction. Neurocomputing 2021, 424, 9–22. [Google Scholar] [CrossRef]
  32. Sohn, K.; Li, C.L.; Yoon, J.; Pfister, T.J. Self-Supervised Learning for Anomaly Detection and Localization. US Patent 11,941,084, 26 March 2024. [Google Scholar]
  33. Yao, X.; Li, R.; Zhang, J.; Sun, J.; Zhang, C. Explicit Boundary Guided Semi-Push-Pull Contrastive Learning for Supervised Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 24490–24499. [Google Scholar]
  34. Tian, Y.; Pang, G.; Chen, Y.; Singh, R.; Verjans, J.W.; Carneiro, G. Weakly-Supervised Video Anomaly Detection with Robust Temporal Feature Magnitude Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 4975–4986. [Google Scholar]
  35. Ruff, L.; Vandermeulen, R.A.; Görnitz, N.; Binder, A.; Müller, E.; Müller, K.R.; Kloft, M. Deep Semi-Supervised Anomaly Detection. arXiv 2019, arXiv:1906.02694. [Google Scholar]
  36. Kingma, D.P. Auto-Encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
  37. Schlegl, T.; Seeböck, P.; Waldstein, S.M.; Schmidt-Erfurth, U.; Langs, G. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In Proceedings of the International Conference on Information Processing in Medical Imaging, Boone, NC, USA, 25–30 June 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 146–157. [Google Scholar]
  38. Schlegl, T.; Seeböck, P.; Waldstein, S.M.; Langs, G.; Schmidt-Erfurth, U. f-AnoGAN: Fast unsupervised anomaly detection with generative adversarial networks. Med. Image Anal. 2019, 54, 30–44. [Google Scholar] [PubMed]
  39. Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
  40. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
  41. Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
  42. Maier-Hein, L.; Reinke, A.; Godau, P.; Tizabi, M.D.; Buettner, F.; Christodoulou, E.; Glocker, B.; Isensee, F.; Kleesiek, J.; Kozubek, M.; et al. Metrics reloaded: Recommendations for image analysis validation. Nat. Methods 2024, 21, 195–212. [Google Scholar] [PubMed]
  43. García, S.; Herrera, F. Design of Experiments in Computational Intelligence: On the Use of Statistical Inference. In Proceedings of the Hybrid Artificial Intelligence Systems: Third International Workshop, HAIS 2008, Burgos, Spain, 24–26 September 2008; Springer: Berlin/Heidelberg, Germany, 2008; pp. 4–14. [Google Scholar]
  44. García, S.; Fernández, A.; Luengo, J.; Herrera, F. A study of statistical techniques and performance measures for genetics-based machine learning: Accuracy and interpretability. Soft Comput. 2009, 13, 959–977. [Google Scholar]
  45. Demšar, J. Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
Figure 1. Categories of AD methods and their respective training image setups. Adapted from [23].
Figure 1. Categories of AD methods and their respective training image setups. Adapted from [23].
Applsci 15 04027 g001
Figure 2. Frequency of each ECF subtype relative to the total number of images in each subdataset.
Figure 2. Frequency of each ECF subtype relative to the total number of images in each subdataset.
Applsci 15 04027 g002
Figure 3. Three sample test images from each subdataset: coronal (C), sagittal (S), and axial (A), and their respective GT.
Figure 3. Three sample test images from each subdataset: coronal (C), sagittal (S), and axial (A), and their respective GT.
Applsci 15 04027 g003
Figure 4. Reconstructions from image reconstruction/restoration methods for the sample images shown in Figure 3.
Figure 4. Reconstructions from image reconstruction/restoration methods for the sample images shown in Figure 3.
Applsci 15 04027 g004
Figure 5. Anomaly maps generated by image reconstruction methods for the samples shown in Figure 3.
Figure 5. Anomaly maps generated by image reconstruction methods for the samples shown in Figure 3.
Applsci 15 04027 g005
Figure 6. Anomaly maps generated by feature modeling methods for the samples shown in Figure 3.
Figure 6. Anomaly maps generated by feature modeling methods for the samples shown in Figure 3.
Applsci 15 04027 g006
Figure 7. Anomaly maps generated by attention-based and self-supervised methods for the samples shown in Figure 3.
Figure 7. Anomaly maps generated by attention-based and self-supervised methods for the samples shown in Figure 3.
Applsci 15 04027 g007
Figure 8. Anomaly maps generated by one semi-supervised, one open-set supervised, and one fully supervised method for the samples shown in Figure 3.
Figure 8. Anomaly maps generated by one semi-supervised, one open-set supervised, and one fully supervised method for the samples shown in Figure 3.
Applsci 15 04027 g008
Table 1. CMR dataset statistics at the patient, sequence, and image levels.
Table 1. CMR dataset statistics at the patient, sequence, and image levels.
Patient LevelSequence LevelImage Level
# %Normal %Abnormal # %Normal %Abnormal # %Normal %Abnormal
Coronal---69061.5938.4111,36190.279.73
Axial---69062.4637.5412,08687.7812.22
Sagittal---69163.2436.7611,62490.809.20
All69149.6450.36207162.4337.5735,07189.5910.41
Table 2. Performance of benchmark methods on the test subset of the CMR dataset across coronal, sagittal, and axial views.
Table 2. Performance of benchmark methods on the test subset of the CMR dataset across coronal, sagittal, and axial views.
MethodCoronalSagittalAxial
pxAP (%) spAUROC (%) spAP (%) pxAP (%) spAUROC (%) spAP (%) pxAP (%) spAUROC (%) spAP (%)
U/IRAE [23]3.30 ± 0.25 67.76 ± 5.0366.57 ± 4.91 *3.75 ± 0.15 56.24 ± 5.97 *56.78 ± 5.45 *3.80 ± 0.36 71.93 ± 2.83 *72.23 ± 2.63 *
VAE [36]1.30 ± 0.26 69.11 ± 4.7968.12 ± 5.46 *1.79 ± 0.12 58.56 ± 5.83 *57.80 ± 4.81 *2.09 ± 0.30 72.09 ± 3.47 *72.45 ± 2.47 *
r-VAE [14]2.46 ± 0.53 68.64 ± 4.6569.30 ± 6.083.40 ± 0.48 66.83 ± 5.1365.51 ± 4.33 3.59 ± 0.66 70.66 ± 6.84 *69.03 ± 7.41 *
f-anoGAN [38]1.31 ± 0.28 69.81 ± 4.9268.20 ± 6.42 *1.76 ± 0.15 62.71 ± 4.75 60.65 ± 4.03 *1.85 ± 0.37 69.45 ± 3.93 *68.13 ± 2.83 *
H-TAE-S [15]1.53 ± 0.21 59.02 ± 4.88 *62.39 ± 4.14 *1.35 ± 0.11 53.36 ± 5.95 *52.77 ± 4.21 *1.04 ± 0.20 63.86 ± 3.00 *61.01 ± 3.30 *
OS-DDPM [13]16.14 ± 2.6673.12 ± 4.0075.75 ± 4.8916.71 ± 2.5374.83 ± 4.4078.49 ± 3.6515.25 ± 1.9078.55 ± 4.6777.22 ± 5.53
U/FMDFR [31]2.73 ± 0.90 70.11 ± 1.9269.01 ± 3.58 *2.99 ± 0.21 59.56 ± 5.71 *60.71 ± 3.73 *7.53 ± 2.09 60.14 ± 4.85 *61.15 ± 4.91 *
FAE [19]8.15 ± 2.4571.97 ± 3.2973.87 ± 4.9011.40 ± 2.7564.53 ± 5.73 64.35 ± 4.11 *8.87 ± 2.64 68.61 ± 5.73 *67.65 ± 5.92 *
RD [26]2.94 ± 1.00 71.15 ± 3.2472.18 ± 4.673.95 ± 0.38 65.39 ± 5.74 65.96 ± 3.90 8.78 ± 2.78 74.77 ± 4.25 73.29 ± 4.00 *
ReContrast [29]5.63 ± 2.27 68.27 ± 3.0969.11 ± 3.29 *9.98 ± 2.1164.86 ± 4.39 67.05 ± 3.07 13.54 ± 4.4078.96 ± 3.0277.77 ± 3.15
CFLOW-AD [28]1.24 ± 0.29 65.20 ± 3.54 *62.14 ± 4.79 *1.74 ± 0.13 55.98 ± 5.87 *57.76 ± 4.01 *3.58 ± 1.10 52.49 ± 5.03 *54.72 ± 4.78 *
PaDiM [25]2.33 ± 0.65 69.47 ± 3.3668.44 ± 4.94 *3.87 ± 0.51 62.11 ± 5.96 *62.12 ± 4.17 *4.88 ± 0.88 63.59 ± 6.25 *63.68 ± 5.81 *
U/ABexpVAE [30]0.94 ± 0.19 55.00 ± 3.17 *55.13 ± 3.46 *1.04 ± 0.19 53.59 ± 6.39 *55.54 ± 4.01 *1.24 ± 0.12 45.06 ± 4.57 *47.51 ± 3.91 *
AMCons [20]0.84 ± 0.13 61.69 ± 4.29 *58.44 ± 3.43 *1.06 ±0.08 51.93 ± 6.22 *51.14 ± 4.61 *1.09 ± 0.14 52.70 ± 3.20 *56.07 ± 4.69 *
U/S-SDAE [16]9.91 ± 3.3974.87 ± 4.0676.14 ± 4.4912.21 ± 1.5774.39 ± 3.2977.57 ± 2.2511.27 ± 3.10 79.14 ± 5.0180.38 ± 4.62
CutPaste [32]1.12 ± 0.23 42.85 ± 5.14 *44.55 ± 4.34 *1.32 ± 0.10 41.70 ± 7.49 *47.32 ± 6.50 *4.02 ± 0.94 71.27 ± 4.97 *72.84 ± 4.98 *
PII [24]8.95 ± 4.6567.65 ± 4.4168.98 ± 4.596.41 ± 2.24 63.29 ± 2.91 64.94 ± 3.20 *11.25 ± 5.32 75.41 ± 2.8676.47 ± 2.57
SS/OCDDAD [23]5.89 ± 0.7566.60 ± 6.2667.88 ± 7.18 *4.45 ± 0.19 63.58 ± 5.42 64.00 ± 3.22 *4.08 ± 0.72 54.66 ± 6.39 *54.02 ± 5.66 *
WSBGAD [33]11.49 ± 3.1667.72 ± 1.7869.42 ± 1.86 *21.62 ± 4.0673.44 ± 2.2076.14 ± 2.3525.84 ± 2.8481.39 ± 2.6582.74 ± 3.07
DRA [27]10%-60.93 ± 3.86 *63.22 ± 3.69 *-63.27 ± 1.86 66.79 ± 2.50 -64.26 ± 3.72 *67.43 ± 2.91 *
25%-66.32 ± 4.65 *68.25 ± 4.39 *-66.91 ± 4.6970.61 ± 3.97-67.75 ± 7.23 *72.66 ± 6.25 *
50%-72.66 ± 3.7774.92 ± 4.13-67.96 ± 2.4471.93 ± 2.29-76.14 ± 5.5880.46 ± 4.39
100%-74.05 ± 1.5574.91 ± 2.68-71.92 ± 1.9475.25 ± 1.58-85.50 ± 1.5687.68 ± 1.39
SSupIC10%-54.23 ± 4.41 *57.58 ± 3.01 *-54.12 ± 5.18 *59.34 ± 4.70 *-55.97 ± 6.79 *61.69 ± 6.85 *
25%-55.28 ± 5.45 *60.70 ± 5.71 *-63.03 ± 3.89 68.40 ± 3.43-69.32 ± 3.46 *75.06 ± 2.70
50%-63.82 ± 3.33 *67.42 ± 1.62 *-65.34 ± 3.11 69.25 ± 3.79-77.67 ± 3.1081.05 ± 2.14
100%-71.88 ± 0.9874.49 ± 2.35-69.92 ± 2.2274.00 ± 1.97-82.61 ± 3.6184.97 ± 3.47
SupIS10%2.54 ± 0.74 64.39 ± 3.89 *62.93 ± 3.38 *3.70 ± 2.42 61.80 ± 5.09 *64.58 ± 5.56 *7.72 ± 1.83 71.36 ± 5.12 *73.28 ± 4.10 *
25%8.70 ± 3.5270.62 ± 1.6271.06 ± 1.7916.47 ± 4.7869.30 ± 6.2573.32 ± 5.1325.63 ± 6.2276.03 ± 3.90 79.21 ± 3.64
50%12.80 ± 3.2471.24 ± 3.3671.81 ± 3.6927.41 ± 6.8372.72 ± 4.9377.33 ± 3.5738.58 ± 5.9181.78 ± 2.9884.02 ± 2.43
100%16.72 ± 4.9670.03 ± 4.2372.66 ± 4.4633.75 ± 7.1674.07 ± 4.6877.69 ± 4.0545.39 ± 6.5187.42 ± 1.9189.23 ± 1.43
Random1.33 ± 0.1950.00 ± 0.0050.00 ± 0.001.29 ± 0.0950.00 ± 0.0050.00 ± 0.001.38 ± 0.1550.00 ± 0.0050.00 ± 0.00
U: unsupervised; SS: semi-supervised; WS: weak/open-set supervised; S: supervised; IR: image reconstruction; FM: feature modeling; AB: attention-based; S-S: self-supervised; OC: one-class; px: pixel; sp: sample. For each metric, the best result is bolded, and the second-best result is underlined. DRA, SupIC, and SupIS were tested with different proportions of abnormal images in the training subset: 10%, 25%, 50%, and 100%. DRA and SupIC perform image classification only. * p < 0.05 in a multiple comparison Finner post hoc test against SupIC with 100% of abnormal training images; p < 0.05 in a multiple comparison Finner post hoc test against SupIS with 100% of abnormal training images.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pinto, E.; Costa, P.M.; Silva, C.; Pereira, V.H.; Fonseca, J.C.; Queirós, S. Benchmarking Anomaly Detection Methods for Extracardiac Findings in Cardiac MRI. Appl. Sci. 2025, 15, 4027. https://doi.org/10.3390/app15074027

AMA Style

Pinto E, Costa PM, Silva C, Pereira VH, Fonseca JC, Queirós S. Benchmarking Anomaly Detection Methods for Extracardiac Findings in Cardiac MRI. Applied Sciences. 2025; 15(7):4027. https://doi.org/10.3390/app15074027

Chicago/Turabian Style

Pinto, Edgar, Patrícia M. Costa, Catarina Silva, Vitor H. Pereira, Jaime C. Fonseca, and Sandro Queirós. 2025. "Benchmarking Anomaly Detection Methods for Extracardiac Findings in Cardiac MRI" Applied Sciences 15, no. 7: 4027. https://doi.org/10.3390/app15074027

APA Style

Pinto, E., Costa, P. M., Silva, C., Pereira, V. H., Fonseca, J. C., & Queirós, S. (2025). Benchmarking Anomaly Detection Methods for Extracardiac Findings in Cardiac MRI. Applied Sciences, 15(7), 4027. https://doi.org/10.3390/app15074027

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop